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Abstract 


In this thesis we have investigated two building blocks for distributed systems: group communication 
services and distributed consensus services. 

Using group communication services is a successful approach in developing fault tolerant dis- 
tributed applications. Such services provide communication tools that greatly facilitate the devel- 
opment of applications. Though many existing systems are used in real world applications, there 
is still the need of providing formal specifications for the group communication services offered by 
these systems. Great efforts are being made by many researchers to provide such specifications. In 
this thesis we have tackled this problem and have provided specifications for group communication 
services. One of our specifications considers the notion of primary view; another one generalizes 
this notion to that of primary configurations (views with quorums). Both specifications are shown 
to be implementable. The usefulness of both specifications is demonstrated by applications running 
on top of them. Our specifications are tailored to dynamic systems, where processes join and leave 
the system even permanently. We also showed how the approach used to develop the specifications 
can be applied to transform known algorithms, designed for stating settings, in order to make them 
adaptable to dynamic systems. 

Distributed consensus is the abstraction of many coordination problems, which are of fundamen- 
tal importance in distributed systems. Distributed consensus has been thoroughly studied and one 
important result showed that it is not possible to solve consensus in asynchronous systems if failures 
are allowed. However in such systems it is possible to solve the k-set consensus problem, which is 
a relaxed version of the consensus problem: each participating process begins the protocol with an 
input value and by the end of the protocol it must decide on one value so that at most k total values 
are decided by all correct processes (the classical consensus problem requires that there be a unique 
value decided by all correct processes). In this thesis we have investigated the k-set consensus prob- 
lem in asynchronous distributed systems. We extended previous work by exploring several variations 
of the problem definition and model, including for the first time investigation of Byzantine failures. 
We showed that the precise definition of the validity requirement, which characterizes what decision 
values are allowed as a function of the input values and whether failures occur, is crucial to the 
solvability of the problem. We introduced six validity conditions for this problem (all considered in 
various contexts in the literature), and we demarcated the line between possible and impossible for 
each case. In many cases this line is different from the one of the originally studied k-set consensus 
problem. 
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Chapter 1 


Introduction 


In the last decade the impact of distributed systems on computing has been tremendous. Nowadays 
no single workstation is stand-alone; even personal computers in homes are connected with the rest 
of the computing world by means of the Internet. Computer interconnections can be classified at 
various levels, depending on the kind of interaction required by the components that are connected. 
The connection can be as simple as a cable connecting two computers and as complex as the Internet 
which literally connects millions of computers around the world. The more distributed is the system 
the more complex is the interconnection. Because of this, distributed systems are more subject to 
failures than stand-alone computers. Also, distributed systems are harder to program because of the 
difficulties deriving from sharing data, sharing resources, and coordinating work. Thus developing 
distributed applications is a complex task. 

A popular and successful approach to managing this complexity is to decompose the system 
design, by constructing the system from pre-defined communication, synchronization, and memory 
building blocks. These building blocks may represent global (that is, system-wide) or local services; 
they may be combined in parallel or may represent different levels of abstraction. The structure 
they provide makes the systems easier to build, to use, and to modify. Examples of such building 
blocks already in use are various types of group membership and group communication services; 
failure detection, leader election, consensus, and atomic commit services; resource allocation and 
synchronization services; and various forms of strongly consistent and weakly consistent shared 
memory. 

In this thesis we investigate two such building blocks, namely, group communication services and 


distributed consensus services. 


1.1 Group communication services 


1.1.1 Overview and related work 


Recently, view-oriented group communication services (see [1] for a survey) have been of particular 
interest. Such a service allows application processes located at different nodes of a distributed 
network to operate collectively as a group, using the service to multicast messages to all members 
of the group. The group communication service is based on a group membership service, which 
provides each group member with a view of the group, including a list of the processes that are group 
members. Messages sent by a process in a view are delivered only to processes that are members of 
that view, and only when they have that view. Within each view, the service offers guarantees about 
the order and reliability of message delivery. Thus, a view-oriented group communication service 
manages both consistent delivery of messages within each view and the reconfiguration involved in 
changing views. Examples of such services are found in the Isis [19], Transis [6, 33], Ensemble [49], 
Newtop [38] and Relacs [14] systems as well as in other systems. Typical applications that use 
view-oriented group communication services include state-machine replication (e.g., [46, 45, 57]), 
distributed transactions and database replication (e.g., [85, 48, 56]), system management (e.g., [4]) 
and monitoring (e.g., [3]), video-on-demand servers (e.g., [10, 9]), collaborative computing, such as, 
distance learning, audio and video conferences, application sharing and even distributed musical 
“jam sessions” (see [87] for more references). 

In order to be most useful group communication services (as well as other building blocks) re- 
quire clear and precise specifications of their guaranteed behavior. Such specifications would allow 
application programmers to think carefully about the behavior of systems that use the primitives, 
without having to understand how the primitives themselves are implemented. Unfortunately, pro- 
viding appropriate specifications for group communication services is not an easy task. Some of 
these services are rather complicated, and there is still no agreement about exactly what the guar- 
antees should be. Different specifications arise from different implementations of the same service, 
because of differences in the safety, performance, or fault-tolerance that is provided. Moreover, the 
specifications that most accurately describe particular implementations may not be the ones that 
are easiest for application programmers to use. 

Hence providing group communication specifications is a serious challenge, and requires theo- 
retical work to support the system development work. Such work has included formal specification 
of global membership and communication services (e.g., [17, 23, 25, 27, 34, 72, 81, 83]), design and 
analysis of distributed algorithms that implement or exploit such services (e.g., [57, 7, 88]), and even 
impossibility results (e.g., [23]). 

The Isis system introduced the important concept of virtual synchrony [19]. This concept has 


been interpreted in various ways, but an essential requirement is that if a particular message is 
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delivered to several processes, then all have the same view of the membership when the message is 
delivered. This allows the recipients to take coordinated action based on the message, the member- 
ship set and the rules prescribed by the application. 

The Isis system was designed for an environment where processors might fail and messages might 
be lost, but where the network does not partition. This assumption might be reasonable for some 
local area networks, but it is not valid in wide area networks. Therefore, the more recent systems 
mentioned above allow the possibility that concurrent views of the group might be disjoint. 

The first major work on the development of specifications for fault-tolerant group-oriented mem- 
bership and communication services appears to be that of Ricciardi [81], and the research area is 
still active (see, e.g., [75, 23, 88, 41]). In particular, there has been a large amount of work on devel- 
oping specifications for partitionable group services. Some specifications deal just with membership 
and views [54, 86] while others also cover message services (ordering and reliability properties) 
[72, 16, 15, 26, 34, 44, 53]. These specifications are all complicated, many are difficult to under- 
stand, and some seem to be ambiguous. It is not clear how to tell whether a specification is sufficient 
for a given application. It is not even clear how to tell whether a specification is implementable at 
all; impossibility results such as those in [23] demonstrate that this is a significant issue. Vitenberg 
et al. [87] provide a comprehensive study and comparison of several existing group communication 
specifications. 

Among previous work the specification of a group communication service provided in [41] is 
particularly relevant to the work done in this thesis. The group communication service specification 
provided in [41], called vs, captures what seem to be the basic property of a view oriented group 
communication service: processes are provided with views of the system and communication is view- 
oriented, meaning that messages sent in a particular view are delivered only within that view. We 
remark that this is not the only property of existing systems, but it is the most important one. The 
strength of the vs specification lies in its simplicity. Yet the specification is powerful enough to build 
applications on top of it. 

Providing specifications that are simple enough to be implementable and strong enough to be 
usable in applications is the key for designing good building blocks. Previous work has shown that 
this is not an easy task: some specifications are too strong to be implementable, e.g. [23], and 
some of them fail to capture the non-triviality of existing group communication services. The vs 
specification has been proven to be implementable and useful as building block for more powerful 
services. Lesley and Fekete [63] have proved that a version of an algorithm of Cristian and Schmuck 
[27] implements the vs service. Khazan et al. [58, 59] have used vs in the design of a load-balancing 
database algorithm and in [41] a totally ordered broadcast service is built on top of vs. 

We have mentioned that real distributed systems can partition. When a partition occurs the main 


problem to be faced is that of maintaining consistency of replicated data. To cope with partitions, 
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usually the application processes perform significant computations only when they have a special 
type of view called a primary view. For example, a replicated database application might only 
perform a read or write operation within a primary view, in order to ensure that each read receives 
the result of the last preceding write, in some consistent order of the operations. In this setting, a 
primary view is typically defined to be one whose membership comprises a majority of the universe 
of processes. The intersection property guaranteed by majorities permits information flow from any 
previous primary to a newly formed one. 

This thesis focuses on primary view group communication services because many real applications 
do need to maintain consistency of replicated data. However, applications that can tolerate some 
degree of inconsistency can use partitionable group communication services. For example, in a 
shared white-board application, a partition would result in users seeing only whatever is written by 
users in their component. When (possibly) the partition is recovered the white-board can display 
information written from each component (maybe with some criteria to merge the different white- 
boards of different components). Another example is a distributed booking system for airline tickets. 
If the system partitions into two (or more) components each of them could still accept reservations, 
provided that the airline is willing to face over-booking. 

In distributed applications involving replicated data, a well known way to enhance the availability 
and efficiency of the system is to use quorums. A quorum system is a set of subsets of the members 
of the system which satisfy the property that any two sets intersect. We refer to a view with a 
quorum system defined over the members of the view as a configuration. Using configurations an 
update can be performed with only a quorum available, while with an ordinary view all of the 
members must be available. The intersection property of quorums permits one to maintain data 
consistency, within a given configuration. Quorum systems have been extensively studied and used 
in applications, e.g. [2, 37, 47, 50, 79, 70, 8, 74]. The use of quorums has been proven effective also 
against Byzantine failures [68, 69]. 

Pre-defined quorum sets can yield efficient implementations in settings which are relatively static, 
i.e., failures are transient. However they work less well in settings where processes routinely join and 
leave the system, or where the system can suffer multiple partitions. For such a setting, a dynamic 
notion of primary is needed. A dynamic notion of primary still needs to maintain some kind of 
intersection property, in order to permit enough information flow between successive primary views 
to achieve coherence. For example, each primary view might have to contain at least a majority of 
the processes in the previous primary view. Several dynamic voting schemes have been developed 
to define primaries adaptively, e.g. [28, 36, 55, 88, 77]. 

In particular, Yeger Lotem, Keidar, and Dolev [88] have described an implementation of a group 
membership service that yields only primary views, according to a dynamic notion of primary. An 


interesting feature of their work is that it points out various subtleties of implementing such a 
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membership service in a distributed manner — subtleties involving different opinions by different 
processes about what is the previous primary view. These difficulties have led to errors in some of 
the past work on dynamic voting. The algorithm of [88] copes with these subtleties by maintaining 
information about a collection of primary views that “might be” the previous primary view. The 


service deals with group membership only, and not with communication. 


1.1.2 Work in this thesis 


In this thesis we have provided group communication specifications which handle primary views 
and primary configurations; the latter required extending the notion of primary view to that of 
primary configuration. We have proved that the specifications are implementable, by exhibiting 
algorithms that implement them, and useful as building block for more powerful services, by pro- 
viding algorithms that implement these more powerful services exploiting the group communication 


services. 


Dynamic views 


We have provided a group communication service, called Dvs, that integrates the vS group com- 
munication service with a dynamic primary view membership service, yielding a dynamic primary 
view group communication service. The DVS service is inspired by the implementation of [88], but 
integrates communication with the group membership service. 

An important feature of the DvS specification is the careful handling of the interface between 
the service and the application. When a new view starts, applications generally require some pre- 
processing, typically, an exchange of information, to prepare for ordinary computation. For example, 
processes in a coherent database application may need to exchange information about previous 
updates in order to bring everyone in the new view up to date. We expect each application process 
to “register” a new view v when it has completed this pre-processing for view v. The DvS service uses 
registration information when it creates a new view v, in order to determine which previously-created 
views must satisfy the intersection property with respect to v. When all members have registered v, 
the application has gathered all information it needs from previous views, and the service no longer 
needs to ensure intersection in membership between views before v and any subsequent ones that 
are formed. 

Another feature of the Dvs specification, compared to that of Yeger Lotem et al. [88], is that 
our specification is given as an automaton, which maintains state information about the views 
and the messages sent in each view. This global state can be used in invariants and abstraction 
functions, leading to assertional proofs of the correctness of implementations of Dvs, and also of 
applications built over Dvs. In contrast, Yeger Lotem et al. use a specification given in terms of 


the whole sequence of events in an execution, and therefore must use operational reasoning about 
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complex sequences of events. Extensive experience with proofs of distributed algorithms suggests 
that assertional techniques are less error-prone; also they are more amenable to automated checking. 

We have demonstrated the value of the Dvs specification by showing both how it can be imple- 
mented and how it can be used in an application. Both pieces are shown formally, with assertional 
proofs. 

The implementation is a variant of the group membership algorithm of [88]. We have proved 
that this algorithm implements Dvs, in the sense of trace inclusion, that is, the external behavior of 
the implementation is allowed by the Dvs specification. The proof uses a (single-valued) simulation 
relation and invariant assertions. The key to the proof is an invariant expressing a strong condition 
about nonempty intersections of views; the proof of this depends on relating a local check of majority 
intersection with known views to a global check of nonempty intersection with existing views. 

We have also provided an application algorithm that is a variant of an algorithm in [57, 7, 41], 
modified to use DVS instead of a static view-oriented service. The modified algorithm uses the 
registration capability to tell the Dvs service that information has been successfully exchanged at 
the beginning of a new view. We show that it implements a (non-group-oriented) totally-ordered- 
broadcast service. This proof also uses a simulation relation and invariant assertions. 

We have designed our Dvs specification to express the guarantees that we think are useful in 
verifying correctness of applications that use the service. Among previous work, two different sorts 
of specifications for a primary group service are notable. Work by Ricciardi and others [83] is 
expressed in terms of temporal logic on consistent cuts; the idea of their specification is that on any 
cut, there are no disjoint sets of processes such that each set is collectively aware of no members 
outside that set. Yeger Lotem et al. [88] use a property of an execution, which was previously defined 
by Cristian [26] for majority groups: any two primary views are linked by a chain of views where 
every consecutive pair of views includes a process that “knows” it belongs to both views. As far 
as we know, these previous specifications have not been used to verify properties of applications 
running above them. 

The DVS specification omits some properties of existing dynamic primary view management 
algorithms. For example, Isis [19] guarantees that processes that move together from one view to 
the next receive exactly the same messages in the first view. Guaranteeing this property requires 
state exchange within the view management service. This property is not needed to verify properties 
of other applications, such as the totally-ordered broadcast service of [41]. Also, our service provides 
no explicit support for application-level state exchange. Real systems, e.g. Isis, do provide such 
support, by allowing application-level state exchange messages to be piggy-backed on the lower-level 


state exchange messages. 
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Dynamic configurations 


Quorum-based methods for managing replicated data are popular because they provide availability 
of both reads and writes in the presence of faulty behavior by some sites or communication links. A 
quorum system is also called a configuration. If a system lasts for a very long time, it may become 
necessary to alter the configuration, perhaps because some sites have failed permanently and others 
have joined the system, or perhaps because users want a different trade-off between read-availability 
and write-availability. For example, if more sites join the system, these sites must be included in 
the quorums in order to use them; If many sites fail permanently, these sites must be taken out 
of the quorums in order to maintain availability. The most common proposal has been to use a 
two-phase commit protocol which stops all application operations while all sites are notified of the 
new configuration. Since two-phase commit is a blocking protocol, this solution is vulnerable to a 
single failure during the configuration change. An alternative proposal in [66] has reconfiguration 
directed by a single site, thus this is also not fault-tolerant. In a setting of database transactions, 
[47] showed how to integrate fault-tolerant updates of replicated information about quorum sizes 
(using the same quorums for both data item replicas and quorum information replicas). 

Herlihy [51] provides algorithms to shrink and enlarge quorums within a static universe of proces- 
sors; the setting considered in [51] does not allow processors to join and leave the system. Lamport 
discusses how to modify his PAXOS algorithm [61] in order allow processors to join and leave the 
system. In this thesis we integrate these aspects in a group communication framework. 

There are subtle issues that arise in managing the change of configurations, including how to 
make sure that any operation using the new configuration is aware of all information from operations 
that used an old configuration, and how to allow concurrent attempts to alter the configuration. 

In this thesis we have addressed this problem by extending dynamic primary views group commu- 
nication services to handle configurations. The main difficulty in combining configurations with the 
notion of dynamic primary view is the intersection property required to maintain consistency among 
data stored at different sites. A dynamic primary view must intersect the previous one in at least 
a quorum of processors (this property is required, for example, by replicated data applications in 
order to keep all the replicas consistent). With configurations this intersection property that works 
for primary views, is no longer enough. Indeed updated information might be only at a quorum and 
the processors in the intersection might be not in that quorum. A stronger intersection property is 
required. We have proposed one possible intersection property that allows applications to keep data 
consistency across changes of primary configurations. Namely, we require that there be a quorum of 
the old primary configuration which is included in the membership set of the new primary configu- 
ration. This guarantees that there is at least one process in the new primary configuration that has 
the most up to date information. This, similarly to the intersection property of dynamic primary 


views, allows flow of information from the old configuration to the new one and thus permits one to 
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preserve data consistency. 

We actually considered a more specialized version of configurations which uses two sets of quo- 
rums, a set of read quorums and a set of write quorums, with the property that any read quorum 
intersects any write quorum. (This choice is justified by the application we develop, an atomic 
read/write register.) With this kind of configuration the intersection property that we require for a 
new primary configuration is that there be one read quorum and one write quorum both of which 
are included in the membership set of the new primary configuration. The use of read and write 
quorums (as opposed to just quorums) can be more efficient in order to balance the load of the 
system (e.g., [37]). 

The resulting dynamic primary configuration group communication service is called Dc. This 
service also integrates support for state exchange into the DC specification. This improves the 
modularity of the building block. When a new configuration starts, applications generally require 
some pre-processing, such as an exchange of information, to prepare for ordinary computation. 
Typically this is needed in order to bring every member of the configuration up to date. For 
example, processes in a coherent database application may need to exchange information about 
previous updates in order to bring everyone in the new configuration up to date. We will refer to 
the up-to-date state of a new configuration as the starting state of that configuration. The starting 
state is the state of the computation that all members should have in order to perform regular 
computation. The computation of the starting state should be offered by the communication service 
so that applications do not have to worry about the details of the underlying state exchange. We 
have demonstrates the value of the DC specification by showing both an algorithm that implements 
pc and how Dc can be used in an application. The implementation is based on a variant of the 
group membership algorithm of [88]. The application is an atomic read/write shared register, and 


is similar to the work of of Attiya, Bar-Noy and Dolev [12] and of Lynch and Shvartsman [66]. 


Dynamic algorithms 


We have investigated the use of the technique introduced to design the Dvs and DC services to 
transform services and applications that are designed for “static” settings, into ones that work well 
in “dynamic” settings. 

We used a variant of the DC service to provide a dynamic version of the Lamport’s PAXOS 
algorithm [61]. The PAXxos algorithm solves a fundamental problem of distributed computing: the 
consensus problem. In such a problem processors of a distributed system start computation with 
an input value and have to make an irreversible decision guaranteeing agreement, which requires 
that all decisions are the same, and validity, which requires that any decision is equal to some input 
value. 


The PAXOS algorithm tolerates many types of failures: timing failures, loss, duplication and 
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reordering of messages and process stopping failures. Process recoveries are considered; some stable 
storage is needed. PAXOS is guaranteed to work safely, that is, to satisfy agreement and validity, 
regardless of process, channel and timing failures and process recoveries. When the distributed 
system stabilizes, meaning that there are no failures nor process recoveries and a majority of the 
processes are not stopped, for a sufficiently long time, termination is achieved; the performance of 
the algorithm when the system stabilizes is good. 

The original algorithm is designed for static settings, where failures are transient, that is, failed 
processors recover. If a majority (or a quorum) of the processors is not available the system is 
blocked. If such a majority or quorum permanently leaves the system, then the system is blocked 
forever. The variant we have designed adapts well to permanent changes of the underlying distributed 
system. 

The PAXOS algorithm bears many similarities with an earlier algorithm of Liskov and Oki [76]. 
The work of Liskov and Oki uses a notion of “view” which changes when a new primary site needs to 
be selected. The notion of “view” and that of “view synchrony” has later been proven very successful 
(see the overview and related work of this section). 

We have also provided a dynamic primary copy data replication algorithm. As the dynamic 
version of PAXOS also this algorithm is based on a variant of the DC service. This algorithm uses 
a centralized approach in which a “leader” process is responsible for providing responses to client’s 
queries. In order to keep consistency this leader process replicates (part of) its own state to a 
quorum of processes. The algorithm exploits the quorum-oriented framework provided by the DC 
service. We sketch the proof of correctness of this algorithm; the technique used to prove correct 


other applications developed in this thesis should apply also to this algorithm. 


1.2 Distributed consensus 


1.2.1 Overview and related work 


Another important building block for distributed systems is distributed consensus. Such a prob- 
lem arises in many forms and various contexts, such as, for example, distributed data replication, 
distributed databases, flight control systems. Data replication is used in practice to provide high 
availability: having more than one copy of the data allows easier access to the data, i.e., the near- 
est copy of the data can be used. However, consistency among the copies must be maintained. A 
consensus algorithm can be used to maintain consistency. A practical example of the use of data 
replication is an airline reservation system. The data consists of the current booking information 
for the flights and it can be replicated at agencies spread over the world. The current booking 
information can be accessed at any of the replicas. Reservations or cancellations must be agreed 


upon by all the copies. 
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In a distributed database, the consensus problem arises when a collection of processes participat- 
ing in the processing of a distributed transaction has to agree on whether to commit or abort the 
transaction, that is, make the changes due to the transaction permanent or discard the changes. 
A common decision must be taken to avoid inconsistencies. A practical example of the use of dis- 
tributed transactions is a banking system. Transactions can be done at any bank location or ATM 
machine, and the commitment or abortion of each transaction must be agreed upon by all the bank 
locations or ATM machines involved. 
In a flight control system, the consensus problem arises when the flight surface and airplane control 
systems have to agree on whether to continue or abort a landing in progress or when the control 
systems of two approaching airplanes need to modify the air routes to avoid collision. 

Distributed consensus has been extensively studied; a good survey of early results is provided in 


[42]. We refer the reader to [65] for a more up-to-date treatment of consensus problems. 


One of the most celebrated result about distributed consensus is the impossibility result of 
Fischer, Lynch and Paterson [43]. This impossibility result, popularly known as FLP, states that it 
is impossible to achieve distributed consensus in asynchronous systems even if only one stop failures 
is possible. This surprising result sparked various directions of research aimed to solve the problem 
by either restricting the asynchrony of the computation model (e.g. [31, 35]) or using randomized 
protocols (e.g. [18, 21, 80]) or weakening the problem definition (e.g. [24, 32, 39, 40]). 

The last of these three directions of research falls in the more general research area of demarcat- 
ing what is deterministically computable and what is deterministically impossible in asynchronous 
distributed systems in the presence of failures. The FLP impossibility seemed to suggest that no 
nontrivial problem could be solved deterministically and asynchronously in the presence of faults. 
Attiya, Bar-Noy, Dolev, Peleg and Reischuk [13] showed that the renaming problem can be solved 
in a deterministic way in asynchronous system in the presence of failures. Informally, in the renam- 
ing problem processors start the computation with a “name” taken from some unbounded ordered 
name space and have to “rename” themselves with names chosen from a new small name space. 
This result revived the research trend of exploring computable and impossible in deterministic asyn- 
chronous distributed systems subject to failures. Following this direction, Chaudhuri [24] defined 
the k-consensus problem, which is a natural generalization of the consensus problem obtained by 
allowing processes to decide on k different values, instead of requiring them to agree on a single 
value. The 1-consensus problem is the classical consensus problem. 

Chaudhuri provided an algorithm to solve the k-consensus problem that tolerates up to a thresh- 
old ¢ of process failures strictly smaller than k. This result proved that the k-consensus problem, 
for k > 2, allows more resilience than the 1-consensus problem. Chaudhuri conjectured that the 
k-consensus problem was impossible to solve while tolerating k or more failures. This conjecture was 


proven true by three independent research teams: Borowsky and Gafni [20], Herlihy and Shavit [52] 
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and Saks and Zaharoglou [84]. Attiya [11] provided an alternative proof of the same result. 

The results of [24, 20, 52, 84] completely characterize the k-consensus problem in asynchronous 
systems with stop failures. In such a model the k-consensus problem is solvable if and only if t < k. 

The formal definition of the k-consensus problem requires three conditions to be satisfied: agree- 
ment, termination and validity. The agreement condition requires that each process decide on a 
value in such a way that the set of decided values has cardinality at most k. The termination con- 
dition simply requires that each (correct) process decide. For what concern the validity condition, 
several variants have been considered in the literature. The validity condition used in [24, 20, 52, 84] 
requires that each of the decision be equal to some input value. 

An alternative definition of the validity condition considered for the 1-consensus problem with 
stop failures requires that if all the inputs to the processes of the systems are equal then any decision 
must be equal to the input (see, for example, Chapter 6 of [65]). 

In a Byzantine environment faulty processes can “mask” their inputs. Hence a more suitable 
validity condition considered for the 1-consensus problem with Byzantine failures requires that if all 


the correct processes have the same input then any decision be the input of a correct process [62, 78]. 


1.2.2 Work in this thesis 


In this thesis we have explored several alternative validity conditions and we consider the k-consensus 
problem in asynchronous systems both with stop failures and with Byzantine failures. We have 
considered six different definitions for the validity condition of the k-consensus problem. In many 
cases the validity condition makes a difference. We have considered the six variations of the k- 
consensus problem both in the stop failure case and in the Byzantine failure case. This lead to twelve 
different problems. One of this is the k-consensus problem considered in [24, 20, 52, 84]. Hence for 
this problem we already know the line that separate solvable from impossible (the problem is solvable 
if and only if the number of allowed failures is strictly less than k). For the other variations of the 
problem and in particular for the Byzantine settings, the line between impossible and possible was 


not known. We have demarcated these lines. 


1.3. Summary of contributions 


This thesis provided new formal specifications for group communication services. The specifications 
are shown to be implementable and useful to build applications. The significance of this work is 
two-fold: on one hand it is a contribution in the identification of useful formal specifications for 
group communication services, a research area very active recently; on the other hand we have 
explored the possibility of integrating into a single group communication building block the notion 


of primary view and that of configuration, both of which are well known but never have been used 
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together. The specifications we have provided are tailored to dynamic systems, where processors 
join and leave the system routinely and possibly permanently. The approach used to design such a 
dynamic services has been applied also to transform known algorithms, designed for stating settings, 
in order to make them adaptable to dynamic systems. 

This thesis investigated also some theoretical aspects of another important building block for dis- 
tributed systems: distributed consensus. We extended previous work by exploring several variations 
of the problem definition and model, including for the first time investigation of Byzantine failures. 
We showed that the precise definition of the validity requirement, which characterizes what decision 
values are allowed as a function of the input values and whether failures occur, is crucial to the 
solvability of the problem. We introduced six validity conditions for this problem (all considered in 
various contexts in the literature), and we demarcated the line between possible and impossible for 
each case. In many cases this line is different from the one of the originally studied k-set consensus 


problem. 


1.4 Thesis roadmap 


The rest of thesis is divided into two parts. The first part is dedicated to group communication 
services while the second part studies the consensus problem. 

Part I (group communication services) is structured as follows. Chapter 2 contains an overview 
of group communication services. Chapter 3 contains notation and terminology used throughout the 
rest of the part and introduces the I/O automaton model, which is used to provide the specifications, 
the implementations and the applications. Chapter 4 describes the vs service of [41]; such a service is 
used as building block for the implementations of the group communication services provided in this 
thesis. Chapter 5 contains the DvS specifications, a specification for a dynamic primary view group 
communication service, together with an implementation and a totally ordered broadcast service 
running on top of Dvs. Chapter 6 contains the DC specifications, a specification for a dynamic 
primary configuration group communication service, together with an atomic read/write register 
implemented top of DC. Chapter 7 provides a version of Lamport’s PAXOS algorithm implemented 
on top of a variation of the DC service. Finally Chapter 8 provides concluding remarks for Part I. 

Part II (distributed consensus) is structured as follows. Chapter 9 contains an introduction to 
the problem. Chapter 10 describes the model of computation and provides a formal definition of the 
problem. Chapters 11 and 12 study the k-set consensus problem in the crash failures and Byzantine 


failures models, respectively. Finally, Chapter 13 provides concluding remarks for Part II. 
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Part I 


Group Communication Services 
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Chapter 2 


Group Communication Services: 


Overview 


Developing distributed applications is a difficult task, because of the complexities of the applications 
themselves and of the fault-prone distributed settings in which they run. Considerable effort is de- 
voted to making distributed applications robust in the face of typical processor and communication 
failures. A successful approach to overcome these difficulties is to modularize the system by im- 
plementing suitable building blocks that provide powerful general-purpose distributed computation 
services. 

Among the most important examples of building blocks are group communication services. Group 
communication services enable processes located at different nodes of a distributed network to op- 
erate collectively as a group. The processes do this by using a group communication service to 
multicast messages to all members of the group. Different group communication services offer differ- 
ent guarantees about the order and reliability of message delivery. Examples are found in Isis [19], 
Transis [33], Totem [73], Newtop [38], Relacs [14] and Horus [86]. 

The basis of a group communication service is a group membership service. Each process, at each 
time, has a unique view of the membership of the group. The view includes a list of the processes 
that are members of the group. Views can change from time to time, and may become different at 
different processes. Isis introduced the important concept of virtual synchrony [19]. This concept 
has been interpreted in various ways, but an essential requirement is that if a particular message 
is delivered to several processes, then all have the same view of the membership when the message 
is delivered which is also the view where the message was sent. This allows the recipients to take 
coordinated action based on the message, the membership set and, obviously, the application. 

To be most useful to application programmers, system building blocks should come equipped 


with simple and precise specifications of their guaranteed behavior. Such specifications would allow 
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application programmers to think carefully about the behavior of systems that use the primitives, 
without having to understand how the primitives themselves are implemented. Unfortunately, pro- 
viding appropriate specifications for group communication services is not an easy task. Some of 
these services are rather complicated, and there is still no agreement about exactly what the guar- 
antees should be. Different specifications arise from different implementations of the same service, 
because of differences in the safety, performance, or fault-tolerance that is provided. Moreover, the 
specifications that most accurately describe particular implementations may not be the ones that 
are easiest for application programmers to use. Example of specifications for group membership and 


communication services can be found in [17, 23, 25, 27, 34, 72, 81, 83]). 


In distributed application involving replicated data, a well known way to enhance the availability 
and efficiency of the system is to use quorums. A quorum system is a set of subsets of the members of 
the system which satisfy the property that any two sets intersect. We refer to a view with a quorum 
system as a configuration. Using configurations an update can be performed with only a quorum 
available, while with an ordinary view all of the members must be available. The intersection 
property of quorums guarantees consistency within a given configuration. Quorum systems have 
been extensively studied and used in applications (e.g., [2, 37, 47, 50, 70, 74]). 

Pre-defined quorum sets can yield efficient implementations in settings where the system is 
relatively static, that is, failures are transient. However, they work less well in settings where the 
set of processors in the network evolves over time, with processes joining and leaving the system. 
For such a setting, a dynamic notion of primary is needed. A dynamic notion of primary still 
needs to maintain some kind of intersection property, in order to permit enough information flow 
between successive primary views to achieve coherence. For example, each primary view might have 
to contain at least a majority of the processes in the previous primary view. Several dynamic voting 
schemes have been developed to define primaries adaptively, e.g. [28, 36, 55, 88, 71, 77]. 

In particular, Yeger Lotem, Keidar, and Dolev [88] have described an implementation of a group 
membership service that yields only primary views, according to a dynamic notion of primary. An 
interesting feature of their work is that it points out various subtleties of implementing such a 
membership service in a distributed manner — subtleties involving different opinions by different 
processes about what is the previous primary view. These difficulties have led to errors in some of 
the past work on dynamic voting. The algorithm of [88] copes with these subtleties by maintaining 
information about a collection of primary views that “might be” the previous primary view. The 
service deals with group membership only, and not with communication. Yeger Lotem et al. prove 
that their protocol satisfies the following condition on system executions: any two (primary) views 
that occur in an execution are linked by a chain of views where for every consecutive pair of views 
in the chain, there is some process that “knows” it belongs to both views. 


In Chapter 5 we provide a group communication service, called Dvs, that integrates the vs 
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group communication service with a dynamic primary view membership service, yielding a dynamic 
primary view group communication service. The DVS service is inspired by the implementation 
of [88], but integrates communication with the group membership service. We also show how the 
DVS specification can be implemented and used for an application. 

In Chapter 6 we extend the notion of “primary view” to that of “primary configuration”. The 
main difficulty in making this step is to identify the intersection property between two successive 
primary configurations that allows to maintain consistency. We propose one possible such a property. 
Namely, we require that there be a quorum of the old primary configuration which is included in the 
membership set of the new primary configuration. This guarantees that there is at least one process 
in the new primary configuration that has the most up to date information. This, similarly to the 
intersection property of dynamic primary views, allows flow of information from the old configuration 
to the new one and thus permits to preserve consistency. 

We actually consider a more specialized version of configurations which uses two sets of quorums, 
a set of read quorums and a set of write quorums, with the property that any read quorum intersects 
any write quorum. (This choice is justified by the application we develop, an atomic read/write 
register.) With this kind of configuration the intersection property that we require for a new primary 
configuration is that there be one read quorum and one write quorum both of which are included 
in the membership set of the new primary configuration. The use of read and write quorums (as 
opposed to just quorums) can be more efficient in order to balance the load of the system (e.g., [37]). 

We provide a a group communication service, called DC, that integrates a group communication 
service with a dynamic primary configuration membership service. We prove that the DC service is 
implementable and can be used for applications. 

Finally, in Chapter 7, we investigate the use of the technique introduced to design Dvs and DC 
to transform services and applications that are designed for “static” settings, into ones that work 
well in “dynamic” settings. Specifically, we design a service similar to DC and we use that service to 
provide a dynamic version of the Lamport’s PAXOS algorithm [61]. The original algorithm is designed 
for system that are relatively static: if a majority (o more generally a quorum) of the processors is 
not available then the algorithm blocks. The dynamic version adapts well to permanent changes of 
the system. We also design a primary copy data replication algorithm; this algorithm is simlar to 
the Liskov-Oki algorithm [76] but considers dynamic settings, while the Liskov-Oki is designed for 


static settings. 
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Chapter 3 


Mathematical foundations 


In this chapter we introduce some terminology and notation, and then we provide the underlying for- 
mal model used to specify our group communication services and applications. Section 3.1 provides 


terminology and notation and Section 3.2 describes the IOA model. 


3.1 Notation and terminology 


3.1.1 Sets, functions, sequences 


We write \ for the empty sequence. If a is a sequence then |a| denotes the length of a. If a is 
a sequence and 1 <i < j < |a| then a(é) denotes the ith element of a and a(i..j) denotes the 
subsequence a(i),a(i+1),...,a(j) of a. The head of a nonempty sequence a is a(1). A sequence can 
be used as a queue: the append operation modifies the sequence by concatenating it with a new 
element and the remove operation modifies the sequence by deleting its head. 

If a and b are sequences, a finite, then aob denotes the concatenation of a and b. We sometimes 
abuse this notation by letting a or b be a single element. We say that sequence a is a prefix of 
sequence b, written a < b, provided that there exists c such that aoc = b. A collection A of 
sequences is consistent provided that a < b or b < a for all a,b € A. If A is a consistent collection 
of sequences, we define lub(A) to be the minimum sequence b such that a < 6 for all a€ A. 

If S is a set, then segof(S) denotes the set of all finite sequences of elements of S. If a € segof (5) 
and f is a partial function from S to T’ whose domain includes the set of all elements of S appearing in 
a, then applytoall(f,a) denotes the sequence b such that length(b) = length(a) and, for i < length(b), 
b(t) = f(a(é)). 

If S is a set, the notation S, refers to the set SU{L}. Whenever S is ordered, we order S, 
by extending the order on $, and making less than all elements of S$. If R is a binary relation, 


then we define dom(R), the domain of R, to be the set (without repetitions), of first elements of the 
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ordered pairs comprising relation R. If f is a partial function from $ to T, and (s,t) € S x T, then 
f © (s,t) is defined to be the partial function that is identical to f except that f(s) = t. 
We denote by arrayof (S) the set of all arrays, indexed by positive integers, whose entries consists 


of elements of S_. 


3.1.2 Processors, views, configurations, identifiers 


P denotes the universe of all processors,! and M the universe of all possible messages. G is a 
totally ordered set of identifiers used to distinguish views or configurations, with a distinguished 
least element go. 

A view v = (g, P) consists of a view identifier g € G and a nonempty membership set P C P; we 
write v.id and v.set to denote the view identifier and membership set components of v, respectively. 
Y denotes the set of all views, and vo = (go, Po) is a distinguished initial view. 

The notion of view can be generalized to that of configuration. A configuration is a view with 
a structure defined on the view. For example a configuration can be a view with a set of quorums 
defined over the memebrship set of the view. However configurations can be tailored to applications. 
For example, applications that use read and write quorums, use configurations which are views with 
a set of read quorums and a set of write quorums; applications that use a “leader” processor use 
configurations with a leader processor. Next we define several types of configurations. We will 
specify the type of configuration we use in the chapter where we use it. 

A configuration is a triple, c = (g, P, Q), where g € G is a unique identifier, P C P is a nonempty 
set of processors, and Q is a nonempty sets of nonempty subsets of P, such that any two subsets 
intersects. Each element of Q is called a quorum of c. 

A more specialized type of configuration is a quadruple, c = (g, P,R, W), where g € G is a unique 
identifier, P C P is a nonempty set of processors, and R and W are nonempty sets of nonempty 
subsets of P, such that RNW # @ for all R € R, W € W. Each element of R is called a read quorum 
of c, and each element of W a write quorum. We refer to this type of configuration as read-write 
quorum configuration. 

Another type of configuration is a quadruple, c = (g, P, Q,p), where g € G is a unique identifier, 
P C P is a nonempty set of processors, and Q is a quorum system and p € P is a distinguished 
processor, called the leader of the configuration. We refer to this type of configuration as leader 
configuration. 

Once fixed a particular type of configuration, welet C denote the set of all configurations. Given 
a configuration c, the notation c.id refers to the configuration identifier g, the notation c.set refers 


to the membership set P; the notation c.grms refers to the quorum system Q while c.rgrms and 


1We use “processor” and “process” interchangeably. 
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c.wqrms refer to the read quorums set R and the write quorum sets W, respectively; the notation 
c.ldr refers to the leader p of configuration c. 

We distinguish an initial configuration co = (go, Po,Ro,Wo) (or co = (go, Po, Qo), or co = 
(90, Po, Qo, Po), depending on the type of configuration that we are using), where go is a distinguished 


configuration identifier. 


3.2. The I/O automaton (IOA) model 


We describe our services and algorithms using the I/O automaton model of Lynch and Tuttle [67] 
(without fairness). 

The I/O automata (IOA for short) model is a formal model suitable for describing asynchronous 
distributed systems. The basic I/O automaton model was introduced by Lynch and Tuttle [67]. 

Various extensions of the basic model have been developed. For example two extensions provide 
formal mechanisms to handle the passage of time and thus are suitable for describing partially 
synchronous distributed systems; these models are the MMT automaton (MMTA for short) and the 
general timed automaton (GT automaton or GTA for short). The MMTA is a special case of GTA, 
and thus it can be regarded as a notation for describing some GT automata. 

For the purpose of this thesis, we will use this basic I/O automaton model, which we describe 
in Section 3.2.1. Section 3.2.2 describes the “composition” operation on automata. The interested 


reader can find more information about IOA in [65]. 


3.2.1 IOA definition 


An I/O automaton is a simple type of state machine in which transitions are associated with named 
actions. These actions are classified into categories, namely input, output, internal and, for the timed 
models, time-passage. Input and output actions are used for communication with the external envi- 
ronment, while internal actions are local to the automaton. The time-passage actions are intended 
to model the passage of time. The input actions are assumed not to be under the control of the 
automaton, that is, they are controlled by the external environment which can force the automaton 
to execute the input actions. Internal and output actions are controlled by the automaton. The 
time-passage actions are also controlled by the automaton (though this may at first seem somewhat 
strange, it is just a formal way of modeling the fact that the automaton must perform some action 
before some amount of time elapses). 

As an example, we can consider an I/O automaton that models the behavior of a process involved 
in a consensus problem. Figure 3-1 shows the typical interface (that is, input and output actions) 
of such an automaton. The automaton is drawn as a circle, input actions are depicted as incoming 


arrows and output actions as outcoming arrows (internal actions are hidden since they are local 
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Init(v) Decide(v) 


1/O automaton 


Send(m) Receive(m) 


Figure 3-1: An I/O automaton. 


to the automaton). The automaton receives inputs from the external world by means of action 
init(v), which represents the receipt of an input value v and conveys outputs by means of action 
DECIDE(v) which represents a decision of v. Actions seND(m) and REcEIvE(m) are supposed to model 
the communication with other automata. 

A signature S is a triple consisting of three disjoint sets of actions: the input actions, in(S), 
the output actions, out(S), and the internal actions, int(S). The external actions, ext(S), are 
in(S)Uout(S); the locally controlled actions, local(S), are out(S)Uint(S); and acts(S) consists of all 
the actions of S. The external signature, extsig(S), is defined to be the signature (in(S), out(S), 0). 
The external signature is also referred to as the external interface. 


An I/O automaton (IOA for short) A, consists of five components: 

e sig(A), a signature 

e states(A), a (not necessarily finite) set of states 

e start(A), a nonempty subset of states(A) known as the start states or initial states 


e trans(A), a state-transition relation, | where  trans(A) G states(A) x 
acts(sig(A)) x states(A); this must have the property that for every state s and every in- 


put action 7, there is a transition (s, 7, s’) € trans(A) 


e tasks(A), a task partition, which is an equivalence relation on local(sig(A)) having at most 


countably many equivalence classes 


Often acts(A) is used as shorthand for acts(sig(A)), and similarly in(A), and so on. 

An element (s,7, 8‘) of trans(A) is called a transition, or step, of A. If for a particular state s 
and action 7, A has some transition of the form (s, 7, s’), then we say that 7 is enabled in s. Input 
actions are enabled in every state. 

The fifth component of the 1/O automaton definition, the task partition tasks(A), should be 


thought of as an abstract description of “tasks,” or “threads of control,” within the automaton. This 
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partition is used to define fairness conditions on an execution of the automaton; roughly speaking, 
the fairness conditions say that the automaton must continue, during its execution, to give fair turns 
to each of its tasks. 

An execution fragment of A is either a finite sequence, 89,71, $1,72,...,%r,$,, Or an infinite 
Sequence, $9, 71, $1,72,--+, Tr, $p,---, Of alternating states and actions of A such that (8,, 741, $441) 
is a transition of A for every k > 0. Note that if the sequence is finite, it must end with a state. 
An execution fragment beginning with a start state is called an execution. The length of a finite 
execution fragment a@ = 89,7, $1,7,---,7%;,$, is r. The set of executions of A is denoted by 
ezecs(A). A state is said to be reachable in A if it is the final state of a finite execution of A. 

The trace of an execution a of A, denoted by trace(a), is the subsequence of a@ consisting of all 
the external actions. A trace 8 of A is a trace 6 of an execution of A. The set of traces of A is 


denoted by traces(A). 


3.2.2 Composition of IOA 


The composition operation allows an automaton representing a complex system to be constructed by 
composing automata representing simpler system components. The most important characteristic of 
the composition of automata is that properties of isolated system components still hold when those 
isolated components are composed with other components. The composition identifies actions with 
the same name in different component automata. When any component automaton performs a step 
involving 7, so do all component automata that have a in their signatures. Since internal actions of 
an automaton A are intended to be unobservable by any other automaton B, automaton A cannot 
be composed with automaton B unless the internal actions of A are disjoint from the actions of B. 
(Otherwise, A’s performance of an internal action could force B to take a step.) Moreover, A and 
B cannot be composed unless the sets of output actions of A and B are disjoint. (Otherwise two 


automata would have the control of an output action.) 


Let J be an arbitrary finite index set”. A finite countable collection {S;}ie, of signatures is said 


to be compatible if for all i,j € I, i # j, the following hold?: 
1. int(S;) N acts(S;) = 0 
2. out(S;) N out(S;) = 0 
A finite collection of automata is said to be compatible if their signatures are compatible. 


?The composition operation for IOA is defined also for an infinite but countable collection of automata [65], but 
we only consider the composition of a finite number of automata. 

3We remark that for the composition of an infinite countable collection of automata, there is a third condition on 
the definition of compatible signature [65]. However this third condition is automatically satisfied when considering 
only finite sets of automata. 
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When we compose a collection of automata, output actions of the components become output 
actions of the composition, internal actions of the components become internal actions of the com- 
position, and actions that are inputs to some components but outputs of none become input actions 


of the composition. Formally, the composition S = [],-, 5; of a finite compatible collection of 


iel 


signatures {S;};<7 is defined to be the signature with 
e out(S) = Ujerout(S;) 
e int(S) = Ujerint(S;) 
e in(S) = Vierin(S;) — User out(S;) 


The composition A =[],-, Ai of a finite collection of automata, is defined as follows:* 


iel 
e sig(A) = []je7 sig(Ai) 
e states(A) =[],, states(Aj) 
e start(A) = [] jc, start(Ai) 


e trans(A) is the set of triples (s,7,s’) such that, for all i € I, if w € acts(A;), then (s;, 7, si) € 


trans(A;); otherwise s; = s/, 
e tasks(A) = Ujertasks (A;) 


Thus, the states and start states of the composition automaton are vectors of states and start 
states, respectively, of the component automata. The transitions of the composition are obtained by 
allowing all the component automata that have a particular action z in their signature to participate 
simultaneously in steps involving z, while all the other component automata do nothing. The 
task partition of the composition’s locally controlled actions is formed by taking the union of the 
components’ task partitions; that is, each equivalence class of each component automaton becomes an 
equivalence class of the composition. This means that the task structure of individual components is 
preserved when the components are composed. Notice that since the automata A; are input-enabled, 


so is their composition. The following theorem follows from the definition of composition. 
Theorem 3.2.1 The composition of a compatible collection of I/O automata is an I/O automaton. 


The following theorems relate the executions and traces of a composition to those of the com- 
ponent automata. The first says that an execution or trace of a composition “projects” to yield 
executions or traces of the component automata. Given an execution, @ = 89,71, 51,..., of A, let 

4The II notation in the definition of start(A) and states(A) refers to the ordinary Cartesian product, while the II 


notation in the definition of sig(A) refers to the composition operation just defined, for signatures. Also, the notation 
s; denotes the ith component of the state vector s. 
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aA; be the sequence obtained by deleting each pair 7,,s8, for which 7, is not an action of A; and 
replacing each remaining s, by (s,);, that is, automaton A,;’s piece of the state s,. Also, given a 
trace 8 of A (or, more generally, any sequence of actions), let G|A; be the subsequence of 6 consisting 
of all the actions of A; in 6. Also, | represents the subsequence of a sequence ( of actions consisting 


of all the actions in a given set in @. 


Theorem 3.2.2 Let {Aj}icr be a compatible collection of automata and let A = [| ,¢7 Ai. 
1. If a € execs(A), then a|A; € execs(A;) for everyie I. 


2. If B € traces(A), then B|A; € traces(A;) for every i € I. 


The other two are converses of Theorem 3.2.2. The next theorem says that, under certain 
conditions, executions of component automata can be “pasted together” to form an execution of the 


composition. 


Theorem 3.2.3 Let {Aj}icr be a compatible collection of automata and let A = [],<,; Ai. Suppose 
ay, is an execution of A; for everyi€ I, and suppose B is a sequence of actions in ext(A) such that 
B\|A; = trace(a;) for every i € I. Then there is an execution a of A such that 8 = trace(a) and 
a, = alA; for everyi € I. 


The final theorem says that traces of component automata can also be pasted together to form 


a trace of the composition. 


Theorem 3.2.4 Let {Aj}icz be a compatible collection of automata and let A = [],<; Ai. Suppose 
B is a sequence of actions in eat(A). If B\A; € traces(A;) for every i € I, then 8 € traces(A). 


Theorem 3.2.4 implies that in order to show that a sequence is a trace of a system, it is enough 


to show that its projection on each individual system component is a trace of that component. 
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Chapter 4 


The vs service 


In this chapter we describe the view-oriented group communication service vs introduced in [41]. The 


Vs service deals with views. We describe “variations” of the service, which deal with configurations. 


4.1 The VS service 


The vs service is a view-oriented group communication service. The name vs stands for “view 
synchrony” and refers to the property that a message sent within a particular view is delivered 
only to members of that view and only if they are still in that view. This seems to be the most 
important property of group communication services that go under the label of “virtual synchrony” 
(the expression “virtual synchrony” has been actually semantically overloaded and several virtual 
synchronous services provide different guarantees). 

Another important feature of the vs specification is that it requires that the sequence of messages 
received by two different processes within a given view are such that one is the prefix of the other. 
Finally, new views are reported to their members in order of view identifier. 

The external actions of the vs specification include vs-cpsnp(m), actions, representing the client 
at p sending a message m, and vs-cprcv(m)p,q actions, representing the delivery to g of the message 
m sent by p. Output actions vs-saFE(m)p,q are also provided at q to report that the earlier message 
m from p has been delivered to all locations in the current view as known by gq. 

The VS service informs its clients of group status changes through vs-NEwview((g, P))» actions 
(with p € P), which tells p that the view identifier g is associated with membership set P and 
that, until another vs-nswview occurs, the following messages will be in this view. After any finite 
execution, the current view at p is defined as the argument v in the last Newview(v)p event, if any, 
otherwise it is either the initial view (go, Py) if p € Po, or L if p ¢ Po. This reflects the concept that 


the system starts with the processors in Po forming the group, and other processors unaware of the 


group. 
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VS 


Signature: 

VS-GPSND(M)p, ME M, pEP 
VS-CREATEVIEW(V), v € V 

VS-ORDER(m, p, 9), ME M, PE Pi g €E G 


Input: 
Internal: 


State: 
created € 2”, init {vo} 
for each p € P: 
current-viewid|p] € G_., init go if p € Po, L else 
for each g € G: 
queuel[g] € seqof(M x P), init A 


Transitions: 


internal vs-CREATEVIEW(v) 
Pre: Vw € created : v.id > w.id 
Eff: created := created U{v} 


output VS-NEWVIEW(v)p 
Pre: v € created 
v.id > current-viewid|p] 
Eff: current-viewid[p] := v.id 


input VS-GPSND(m)>p 
Eff: if current-viewid[p] ~ L then 
append m to pending|p, current-viewid[p]] 


Output: Vs-GPRCV(M)p,q, mE M,p,gE P 
VS-SAFE(™M)p,q5 ME M, PGE P, 


VS-NEWVIEW(V)p, UV € V, p € v.set 


for each pE P,g EG: 
pending[p, 9] € segof(M), init » 
nest[p,g] € N>°, init 1 
neat-safe[p,g] © N°, init 1 


internal vs-ORDER(m, p, 9) 
Pre: m is head of pending |p, 9] 
Eff: remove head of pending[p, 9] 
append (m,p) to queue[g] 


output VS-GPRCV(m)p,q, choose g 
Pre: g A L 
g = current-viewid|q] 
queue[g](neat|g, g]) = (m,p) 
Eff: neat[g,g] := nest[g, g] +1 


output VS-SAFE(™)p,q, choose g, P 

Pre: g AL 
g = current-viewid|q] 
(g,P) € created 
queuelg](nezt-safela,g]) = (m,p) 
for all r € P: 

nestr,g] > neat-safe[q, g] 
Eff: nezt-safe[q,g] := neaxt-safelq,g] +1 


Figure 4-1: The vs service 
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The code is given in Figure 4-1. 

The state of the vs service keeps track of the created views in variable created, and for each 
processor p, it keeps track of the current view at p, in variable current-view[p]. For each view, 
incoming message from a client at p are first buffered into a queue for processor p, pending|p, gl, 
and then they are “ordered” into a global queue for the view, queue[g]. The pointers nezt|p, g] and 
next-safe|p, g] point to, respectively, the next message of the gloabal queue that has to be delivered 
to the client at p and the next safe indications that has to be delivered to the client at p. In any 
trace of the vs service, there is a natural correspondence between vs-cprcv events and the vs-cpsnp 
events that cause them, and between vs-sarr events and the vs-cpsnp events that cause them. 

The actions for creating a view and for informing a processor of a new view are straightforward 
(recall that the signature ensures that only members, but not necessarily all members, receive 
notification of a new view). 

A message that is sent before the sender knows of any view (when the current view is L) is 
simply ignored, and never delivered anywhere. 

Note that vs specification does not include any restrictions on when a new view might be formed. 
Clearly it is possible to analyze the service conditionally to some restrictions on the execution. Fir 
example, the performance and fault-tolerance property analysis provided in [41], does consider some 
restrictions: it implies that “capricious” view changes must stop shortly after the behavior of the 
underlying physical system stabilizes. 

We note that the fact that vs allows views to be created only in order of view identifier is 
unimportant: weakening this requirement to allow out-of-order view creation would not change the 
external behavior, because vs-newview actions are constrained to occur in such a way that views are 


delivered in order of view indentifiers anyway. 


The following are safety properties of the vs service which we will be using in Chapter 5. 


e New views are reported in increasing order of view identifier (Monotone views property); 
e Messages sent in a view are delivered only within that view (View synchrony property); 


e The sequences of messages delivered in a view at any two processors are such that one sequence 


is a prefix of the other (Prefix order property). 
The following invariant holds. 


Invariant 4.1.1 (vs) 


In any reachable state, if v,v' € created and v.id = v' id, thenv =v". 
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4.2 Variations: The CS and LC services 


In many applications involving shared data, updates to the shared data have to be agreed upon by 
all the members of a view. In order to improve availability of the service and to balance the load of 
the system it is desirable to make updates without involving all the members of a view while still 
preserving consistency. This is achieved by using configurations. 

A configuration is different from an ordinary view in that it is an ordinary view equipped with a 
set of subsets of the members of the view which satisfy the property that any two sets intersect. Such 
sets are called quorums. Hence a configuration is a view plus a set of quorums. The intersection 
property of quorums guarantees consistency within a given configuration: indeed for any given 
quorum there is always at least one process that has the latest update. 

We will consider two more specialized types of configurations, which have been introduced in 
Chapter 3. 

On such a configuration is the read-write-quorum configuration. Recalling the definition from 
Chapter 3, we have that a configuration is c = (g, P,R, W), where g is a configuration identifier, P 
is the set of members and R and W are the sets of read and write quorums. 

Another such a configuration is the leader configuration. Recalling the definition from Chapter 3, 
we have that, in this case, a configuration is c = (g, P, Q, p), where g is a configuration identifier, P 
is the set of members and Q is the sets of quorums and p is the leader of the configuration. 

The Vs service supports ordinary views v = (g, P) but can be easily generalized to configurations. 
We call these generalizations Cs and LC, respectively, for the read-write-quorums configurations and 
for the leader configurations. The only difference between Cs and VS, as well as LC and VS, is that 
cs and LC announce configurations while vS announces ordinary views. The code of Cs, as well as 
that of LC, is exactly the code of vs. Indeed configurations are treated as a single entity, as are 
ordinary views. The reason we “rename” the code is because the two services are actually different 
(one provides views and the other provides configurations), thus we must distinguish them. 

Figure 4-2 shows the code of the Cs and LC specifications. Since these codes are the same as VS all 
the properties and invariants of VS apply to Cs and LC too. In particular we have that configurations 
are reported in increasing order of configuration identifier (Monotone configurations property), mes- 
sages sent in a configuration are delivered only within that configuration (Configuration synchrony 
property), and the sequence of messages delivered in a configuration at any two processes are such 


that one is a prefix of the other. 
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CS and LC 


Signature: 

CS8-GPSND(m)p, m € M, pe P 
CS-CREATECONF(c), c € C 
CS8-ORDER(M, p,g), MEM, pEP, g EG 


Input: 
Internal: 


State: 
created € 2°, init {co} 
for each p € P: 
current-confid|p] € G_, init go if p € Po, 1 else 
for each g € G: 
queuel[g] € seqof(M x P), init » 


Transitions: 


internal CS-CREATECONF(c) 
Pre: Vw € created : c.id > w.id 
Eff: created := created U{c} 


output CS-NEWCONF(c)p 
Pre: c € created 
c.id > current-confid|p] 
Eff: current-confid[p] := c.id 


input CS-GPSND(m)p 
Eff: if current-confid[p] A L then 
append m to pending|p, current-confid[p]] 


internal Cs-ORDER(m, p, 9) 
Pre: m is head of pending|p, g] 
Eff: remove head of pending|p, g] 
append (m, p) to queue[g] 


Output: CS-GPRCV(M) p,q, m € M, p,q EP 
CS-SAFE(™)p,q, m € M, p,q € P, 


CS-NEWCONF(C)p, ¢ € C, p € c.set 


for each p€ P, g EG: 
pending[p, 9] € segof(M), init 
nezt[p,g] € N>®, init 1 
neat-safe[p,g] € N>°, init 1 


output CS-GPRCV(m)p,q, choose g 
Pre: g = current-confid[q] 
g#4 
queue[g](nextlq, g]) = (m,p) 

Eff: neat[g,g] := nest[g,g] +1 


output CS-SAFE(M)p,q, choose g,P 

Pre: g = current-confid[q] 
g#L 

(9, P) € created 


queue[g)(nest-safe(a, g]) = (m,p) 
for all r € P: 
nestlr,g] > neat-safe[q, g] 
Eff: neat-safelq, g] := neaxt-safelg,g] +1 


Figure 4-2: The Cs service 
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Chapter 5 


The DVS service 


In this chapter we present DVS, a specification for a dynamic primary view group communication 
service. Section 5.1 provides the DvS specification, Section 5.2 provides an implementation of Dvs 
and finally Section 5.3 describes an application that uses Dvs as building block. Section 5.4 closes 


the chapter with some remarks. 


5.1 The DvS specification 


The DVS service works as follows. Each client of the service has a “current” view of the group of 
processes. A process can send a message to all other members of its current view and the service 
guarantees that messages sent within a view are delivered only within that view and each member of 
the view receives messages in the same order as other members. However, not all messages need to 
be delivered to all members. The service also provides a “safe” notification for a particular message 
m that tells the recipient that message m has been received by all the members of the current view. 
New views are announced to all members of the new view and new views are guaranteed to be 
“primary” views. Primary views are defined according to a dynamic notion [55]: a new primary 
needs to contain a majority of the members of the previous primary. The DvS service allows the 
clients to “register” a new view after completing the pre-processing for that view. 

The specification is given in Figure 5-1. In this specification, M, C M denotes the set of messages 
that clients may use for communication. The most interesting part of the Dvs specification is the 
transition definition for pvs-creaTEview(v). The precondition specifies the properties that a view 
must satisfy in order to be considered primary. The precondition says that v.set must intersect the 
membership set of all previously-created smaller-id views w for which there is no intervening totally 
registered view — that is, the set of all “possible previous primary views”. Since (for convenience) we 
allow out-of-order view creation in DVS, we also include a symmetric condition for previously-created 


larger-id views. All created views are recorded in created. 
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DVS 


Signature: 


Input: DVS-GPSND(M)p, ME Mc, pE P 
DVS-REGISTERp, p € P 

Internal: DVS-CREATEVIEW(v), v € V 
DVS-ORDER(m,p,g),;m € Mc, DEP, 9 EG 


Output: DVS-GPRCV(M)p,g, M € Mc, p,g E P 
DVS-SAFE(™)p,q, ME Me, PGE P 
DVS-NEWVIEW(U)p, v € V, p € v.set 


State: 
created € 2”, init {vo} 
for each p € P: 
current-viewid|p] € G_, init go if p € Po, 1 else 
for each g € G: 
queuelg] € seqof(Mc- x P), init » 
attempted[g] € 2”, init Po if g = go, 0 else 
registered|g] € 2”, init Po if g = go, @ else 


for each pE P, g EG: 
pending|[p, 9] € seqof(Mc), init A 
neat[p,g] € N°, init 1 
neat-safe[p,g] € N°, init 1 


Derived variables: 


Att € 2”, defined as {v € created | attempted[v.id] 4 0} 
Tot Att € 2”, defined as {v € created | v.set C attempted|v.id]} 
Reg € 2”, defined as {v € created | registered[v.id] 4 0} 
TotReg € 2”, defined as {v € created | v.set C registered[v.id]} 


Transitions: 
internal Dvs-CREATEVIEW(v) internal Dvs-ORDER(™M, p, 9) 
Pre: Vw € created: v.id # w.id Pre: m is head of pending |p, g] 
Vw € created : Eff: remove head of pending[p, 9] 
da € TotReg : w.id < uid < v.id append (m,p) to queue[g] 
or da € TJotReg : v.id < wid < w.id 
or v.setN w.set £ @ output Dvs-GPRCV(m) p,q, choose g 
Eff: created := created U {vu} Pre: g = current-viewid|q] 
queuelg|(next[g, 9]) = (m,p) 
output DVS-NEWVIEW(v)p Eff: neat[g,g] := nest[g, g] +1 
Pre: v € created 
v.id > current-viewid|p] output Dvs-SAFE(m)p,q, choose g, P 
Eff: current-viewid[p] := v.id Pre: g = current-viewid|q] 
attempted[v.id] := attempted|v.id] U {p} (9, P) € created 
queuelg|(next-safelq, 9]) = (m, p) 
input DVS-REGISTERp for all r € P: 
Eff: if current-viewid[p] ~ L then nestr, g] > neat-safe[q, g] 
registered|current-viewid|p]]| := Eff: nest-safe[q, 9] := neaxt-safe[g,g] +1 


registered|current-viewid|p]] U {p} 


input Dvs-GPSND(m)p 
Eff: if current-viewid[p] ~ L then 
append m to pending|p, current-viewid[p]] 


Figure 5-1: The DVS service. 
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The DVS service informs its clients of view changes using Dvs-NEWvIEW((g,P))p actions; such an 
action informs processor p that the view identifier g is associated with membership set P and that 
the current group of processors connected to pis P. After any finite execution, we define the current 
view at p to be the argument v in the last pvs-newview(v)p, event, if any, otherwise it is the initial view 
vg for processors in Po and is undefined for other processors. Even though views can be created out 
of view identifier order, the notification to each client is consistent with that order. Not every client 
needs to see every view. The variable attempted records, for each view, which processes have been 
notified of that view. Variable attempted is only used in proving the correctness of an implementation 
of DvS. 

With the pvs-recisteR, action, the client at p informs the service that it has obtained whatever 
information the application needs to begin operating in the new view v. For many applications, this 
will mean that p has received messages from every other member of view v, reporting its state at the 
start of v. The variable registered records, for each view, which process have registered that view. 
Variable registered is only used in proving the correctness of an implementation of Dvs. 

The DvS service allows a processor p to broadcast a message m using a pvs-cPsnD(m), action, and 
delivers the message to a processor q using a pvs-GPRcv(m)p,q action. DVS also uses a pvs-saFE(m)p,q 
action to report to processor q that the earlier message m from p has been delivered to all members 
of the current view of g. DVS guarantees that messages sent by a processor p when the current view 
of p is v are delivered only within view v (i.e., only to processors in v.set whose current view is v). 
Moreover, each processor receives messages in the same order as other processor and without gaps 
in the sequence of received messages; however, a processor may receive only a prefix of the sequence 
of messages received by another processor. Variables queue, pending, next and next-safe are used for 
handling the messages. Their use should be clear from the code. 

There are four derived variables, Att, Jot Att, Reg and JotReg. Informally, a view belongs to 
the set Att if it has been reported to at least one member of the view (we say that it is attempted). 
A view belongs to the set JotAtt if it has been reported to all members of the view (we say that 
the view is totally attempted). Similarly, a view belongs to the set Reg if at least one member of 
the view has registered the view (we say that it is registered) and belongs to the set JotReg, if all 


members of the view have registered the view (we say that the view is totally registered). 


We close this section with some invariants stating properties of Dvs. 


Invariant 5.1.1 (Dvs) 


In any reachable state, JotAtt C Att, JotReg C Reg, Reg C Att, and TotReg C Jot Att. 


Invariant 5.1.2 (Dvs) 


In any reachable state if, p € attempted|g] then current-viewid[p| > g. 
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Invariant 5.1.3 expresses the key intersection property guaranteed by Dvs; this is weaker than 
the intersection property required by static definitions of primary views, which says that all primary 
components must intersect. This invariant is our version of the correctness requirement for dynamic 


view services that two consecutive primary views intersect. 


Invariant 5.1.3 (Dvs) 
In any reachable state, if v,w € created, v.id < w.id, and there is no x € JotReg such that v.id < 
z.id < w.id, then v.set Nw.set £9. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state created = {vo} and thus the invariant is 
vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 
for any possible step (s, 7, s’). The only steps that can change the hypothesis from false to true are 
DVS-CREATEVIEW(v) and pys-cREATEVIEW(w). The preconditions of these actions show that the needed 


conclusion holds. No step changes the conclusion from true to false. O 


Invariant 5.1.4 says that if a view w is totally attempted, then any earlier view v has a member 


whose current view is later than v. 


Invariant 5.1.4 (Dvs) 
In any reachable state, if v € created, w € JotAtt, and v.id < w.id, then there exists p € v.set with 


current-viewid|p] > v.id. 


Proof: Consider any particular reachable state. Assume that v € created, w € JotAtt, and v.id < 
w.id. Then let y be the view in Jot.Att having the smallest viewid strictly greater than v.id. Then 
there is no x € JotAtt with vid < aid < y.id. Then Invariant 5.1.1 implies that there is no 
x € JotReg with v.id < xid < y.id. Then Invariant 5.1.3 implies that v.set N y.set # 0. Let 
pé€ v.setny.set; then p € attempted[y.id]. Then Invariant 5.1.2 implies that current-viewid[p] > y.id. 
This implies current-viewid[p] > v.id. O 


5.2 An implementation of Dvs 


We now give an implementation of the Dvs service which we call DVS-IMPL. In Section 5.2.1 we de- 
scribe DVS-IMPL, in Section 5.2.2 we provide some invariants of DVS-IMPL and finally in Section 5.2.3 


we prove that DVS-IMPL implements DVS, in the sense of inclusion of sets of traces. 
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5.2.1 Overview 


The implementation uses as a building block the group communication service vs (see Chapter 4) 
and uses ideas from [88]. The overall system is the composition of an automaton vs-TO-Dvs, for 
each p € P, and vs, with the external actions of vs hidden in the composition. This system is called 


DVS-IMPL and is illustrated in Figure 5-2. 


DVS-IMPL 


Figure 5-2: The DVS-IMPL system. 


The automaton vS-TO-DVS, is given in Figure 5-3. vS-TO-DVS, uses special non-client messages, 
tagged either with “info” or “registered”. Thus, we use M = M.U{(“info” xV x2”) }U{ “registered” }, 
where M, is the set of all client messages and M is the universe of all messages. The attempted, reg, 
and info-sent state variables are not needed for the algorithm, but only for the correctness proof. 

Automaton VS-TO-DvSs, acts as a “filter”, receiving vs-newview inputs from the underlying vs 
service and deciding whether to accept the proposed views as primary views. If vS-TO-DvSs, decides 
to accept some such view v, it “attempts” the view by performing a pvs-Newview(v) output. For 
each v, we think of the Dvs internal pvs-creaTreview(v) action as occurring at the time of the first 
DVS-NEWVIEW(v) event. 

According to the DVS specification, the algorithm is supposed to guarantee nonempty intersec- 
tion of each newly-created primary view v with any previously-created view w having no intervening 
totally registered view — a global condition involving nonempty intersection. The vs-TO-DVvSs, pro- 
cessors, however, do not have accurate knowledge of which primary views have been created by other 
processors, nor of which views are totally registered. Therefore, the processors employ a local check 
of majority intersection with known views, rather than a global check of nonempty intersection with 
existing views. Specifically, each vS-TO-Dvs, keeps track of an “active” view act, which is the latest 
view that it knows to be totally registered, plus a set of “ambiguous” views amb, which are all the 
views that it knows have been attempted (i.e., have had a pvs-nswview action performed someplace), 
and whose ids are greater than act.id. We define use = {act} U amb. When vS-TO-DvS, receives a 
VS-NEWVIEW(v) input, it sends out “info” messages containing its current act and amb values to all the 
other processors in the new view, using the VS service, and then waits to receive corresponding “info” 


messages for view v from all the other processors in the view. After receiving this information (and 
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VS-TO-DVS 


Signature: 

Input: DVS-GPSND(M)p, m € Me Internal: DVS-GARBAGE-COLLECT(v)p, v € V 
DVS-REGISTERp Output: vs-GPSND(m)p, m € M 
VS-NEWVIEW(U)p, v € V, p € v.set DVS-NEWVIEW(v)p, v € V, p € v. set 
VS-GPRCV(™M)g,p, MEM, gE P DVS-GPROV(M)g,p, ME Me, EP 
VS-SAFE(M)q,p, MEM, qEP DVS-SAFE(M)q,p, ™ € Mc, gE P 

State: 


cur € V1, init vo if p € Po, 1 else 
client-cur € V_, init vg if p € Po, 1 else 
act € VY, init vo 
amb € 2”, init 0 
attempted € 2”, init {vo} if p € Fo,0 else 
for each gE G,qEP 
info-revd[q,g] € (V x 2”), init L 
rcvd-rgst|q,g| a bool, init false 


for each g EG 
msgs-to-vs[g] € seqof(M), init » 
msgs-from-vs|g] € segof(Me x P), init » 
safe-from-vs[g] € seqof(Mc x P), init 
reg|g| a bool, init true if p € Po and g = go, false else 
info-sent[g] € (V x 2”)1, init L 


Derived variables: 

Att € 2”, defined as Att = {v € created | (Sp € v.set)u € attempted, }; 

Reg € 2”, defined as Reg = {v € created | (Ap € v.set)reg|v.id], = true}; 
TotAtt € 2”, defined as JotAtt = {v € created | (Vp € v.set)v € attempted, }; 
TotReg € 2”, defined as TotReg = {v € created | (Vp € v.set)reg|v.id]p = true}. 
use € 2”, defined as use = {act} U amb 


Transitions: 
input VS-NEWVIEW(v)p input vs-sAFE({ “registered”))q,p 
Eff: cur :=v Eff: none 
append (“info”, act, amb) to 
msgs-to-vus[cur.id] internal DvS-GARBAGE-COLLECT(v) p 
info-sent[cur.id] := (act, amb) Pre: Vq € v.set : rcevd-rgst(q, v.id] = true 
v.id > act.id 
input vs-GPRCV((“info”,v,V))q,p Eff: act := v 
Eff: info-rcvd[q, cur.id] := (v, V) amb := {w € amb | w.id > act.id} 
if v.id > act.id then act := v 
amb := {w € ambUV | w.id > act.id} input DVS-GPSND(m)p 
Eff: if client-cur.idp # 1 then 
input vs-sAFE((“info”,v,V))q,p append m to msgs-to-vs[client-cur.id] 
Eff: none 
output VS-GPSND(m)>p 
output DVS-NEWVIEW(v)p Pre: m is head of msgs-to-vs[cur.id] 
Pre: v = cur Eff: remove head of msgs-to-vs[cur.id] 
vid > client-cur.id 
Vq € v.set,qg#p: info-revd[q,v.id] A L input vS-GPRCV(™m)q,p, where m € Me. 
Vw € use: |v.setM w.set| > |w.set|/2 Eff: append (m,q) to msgs-from-vs[cur. id] 
Eff: amb := ambU {uv} 
attempted := attempted U {v} output DVs-GPRCV(™M)gq,p 
client-cur := v Pre: (m,q) is head of msgs-from-vs[client-cur.id] 


Eff: remove head of msgs-from-vs|client-cur.id] 
input DVS-REGISTERp 


Eff: if client-cur # L then input VS-SAFE(™m)g,p, where m € Me, 
reg|client-cur] := true Eff: append (m,q) to safe-from-vs[cur.id] 
append (“registered”) to 

msgs-to-vs[client-cur.id] output DVS-SAFE(™M)>p 
Pre: (m,q) is head of safe-from-vs[client-cur.id] 
input vs-GPRCV({ “registered”))q,p Eff: remove head of safe-from-vs[client-cur.id] 


Eff: rcvd-rgst|cur.id, q] := true 


Figure 5-3: The vs-TO-Dvs, code. 
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updating its own act and amb accordingly), vS-TO-Dvs, checks that v has a majority intersection 
with each view in use. If so, VS-TO-DVS, performs a pvs-NEWviEWp Output. 

Then the clients can use the communication system to exchange state information as needed for 
processing in view v. When client at p has obtained enough information, it “registers” the view by 
means of action pvs-REGISTER,, which causes processor p to send “registered” messages to the other 
members. When a processor receives “registered” messages for a view v from all members, it may 
perform garbage collection by discarding information about views with ids smaller than that of v. 
VS-TO-DVS uses VS to send and receive messages. 

The system DvS-IMPL is defined as composition of all the vs-TO-Dvs, automata and vs with all 
the external actions of vs hidden. 

There are four derived variables for DVS-IMPL analogous to those of Dvs, indicating the at- 
tempted, totally attempted, registered, and totally registered views, respectively. Another derived 


variable, use, is defined in the code. 


5.2.2 Invariants of DVS-IMPL 


This section contains invariants of DVS-IMPL needed for the proof that DVS-IMPL implements Dvs in 


Section 5.2.3. The first invariants state simple facts about Dvs. 


Invariant 5.2.1 (DVS-IMPL) 


In any reachable state, if cur, # L then current-viewid|[p] = cur.idy. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p. In the initial state we have that cur, = L. 
For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 
for any possible step (s,7, s'). Fix p. We prove the invariant considering each possible action 7. 
1. a = vs-NEWVIEW(v)p. 
By the code of z in vs, we have that current-viewid[p] = v.id. By the code of 7 in DVS-IMPL, 
we have that cur.id, = v.id. 
2. Other actions. 


Variables current-viewid|p| and cur.id, are not modified. Hence the assertion cannot be made 


false. 


Invariant 5.2.2 (DVS-IMPL) 


In any reachable state, ifv € attempted, then client-cur.id, > v.id. 
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Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix v,p. In the initial state we have that attempted, = {vo} for 
pé Py and attempted, = 1 for p ¢ Po. So assume that v = vp and p € Po. Then client-cur, = vo. 
Hence the invariant is true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ for 
any possible step (s, 7, s’). Fix v,p and assume that v € s'.attempted,. We distinguish two possible 


cases. 
1. v € s.attempted,,. 
By the inductive hypothesis we have that s.client-cur, > v.id. By the monotonicity of 
client-cur, we have that s'.client-cur, > s.client-curp. 
2. v ¢ s.attempted,,. 


Then it must be 7 =pvs-Newview(v)p». The invariant follows from the code which sets client-cur, 


to v. 


Invariant 5.2.3 (DVS-IMPL) 


In any reachable state, if v € info-sent|g], = (a, X) then cur.idy > g. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix v,p. In the initial state we have that info-sent, = 1 and 
thus the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ for 
any possible step (s,7,s’). Fix p,g,x,X and assume that s’.info-sent[g], = (x,X). We distinguish 


two possible cases. 


1. s.info-sent|g|p = (x, X) 
By the inductive hypothesis we have that s.cur, > g. By the monotonicity of cur, we have 


that s'.cur, > s.cur,. Hence the invariant is true. 


2. s.info-sent|g]p 4 (x, X) 


Then it must be 7 =vs-newview(v)p and g = v.id = s'.act.id,. Action vs-newview(v)p sets s’.cur 


to v, so s’.cur.id = g. 


Invariant 5.2.4 (DVS-IMPL) 


In any reachable state: 
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1. vo € TotReg. 


2. go < v.id for all v € created. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Part 1 is true because in then initial state every processor 
p € Po has reg[go] = true. Part 2 is true because the only view in created is vo. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 
for any possible step (s,7, 8‘). 

Consider Part 1 first. No view is ever removed from JotReg. Hence no step can make the assertion 


false. Consider Part 2 now. Fix v and assume that v € s’.created. We distinguish two cases. 


1. v € s.created. 


Then the assertion follows from the inductive hypothesis. 


2. uv & s.created. 
It must be 7=vs-crEaTEVIEW(v)p. By the precondition of this action we have that v.id > w.id 
for all w € s.created. By the inductive hypothesis go < w.id for all w € s.created. Since 


s'.created = s.created U {v}, it follows that go < w.id for all w € s’. created. 


Invariant 5.2.5 (DVS-IMPL) 


In any reachable state, if revd-rgst|q, v.id], A L then cur, # L. 


Proof: By induction on the length of the execution. The base case consists of proving that the invari- 
ant is true in the initial state. Fix p,q and v. In the initial state we have that revd-rgst[q, v.id|,p = L. 
Hence the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 
for any possible step (s,7,s’). Fix p,q,v. We prove the invariant considering each possible action 


a. Assume that s’.rcud-rgst|q, v.id]p # L. 


1. a = vs-NEWVIEW(v)p. 


Since s’.cur, =v we have that s’.cur, A L (vs cannot deliver 1, it is not a view). 


2. 7 = vs-GPROV((“registered’’))p,q- 


By the precondition of z (see vs) we have that s.current-viewid|p] # L. By Invariant 5.2.1 we 


have s.cur.id, = s.current-viewid[p] # L. Hence s'.cur.id, = s.cur.idp # L. 
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3. Other actions. 


Variables revd-rgst[q,v.id], and cur, are not modified. Hence the assertion cannot be made 


false. 


Invariant 5.2.6 (DVS-IMPL) 


In any reachable state, if cur.id, = L then act, = vo and amb, = 0. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p. In the initial state we have that act, = vp and amb, = 0. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 
for any possible step (s,7,s'). Fix p. We prove the invariant considering each possible action 7. 


Assume that s’.cur,p = L. Since no actions sets cur, to 1 it must be s.cur, = 1. 


1. 7 = vs-cprov((“info",v,V))p,q - 


This cannot happen. Indeed by precondition of z (see vS) we have that s.current-viewid|p] 4 L. 
By Invariant 5.2.1 we have s.cur.id, = s.vS.current-viewid|p| Hence s'.cur.id, = s.cur.idp # 


tL. But we know that s’.cur.id = L. 


2. 7 = DVS-NEWVIEW(v)v. 


Cannot happen. Indeed the precondition of 7 says that v = s.curp. Since s.cur.id = L, we 
have v = 1. Thus the precondition v.id > client-cur.id, cannot be satisfied (1 cannot be 


strictly greater than any view identifier). 


3. 7 = DVS-GARBAGE-COLLECT(v). 


Cannot happen. Indeed by Invariant 5.2.5 we have that s.cur, # L. But we know that 


8.curp = 1. 


4. Other actions. 


Variables cur,, act, and amb, are not modified. Hence the assertion cannot be made false. 


0 


The following invariant states that if an “info” message is in transit for view v or has been 
received by some process g in view v then there exists a process p that has sent the “info” in view 


v and such that its current view is either v or a later one. 


Invariant 5.2.7 (DVS-IMPL) 


In any reachable state, let C be the following condition: 
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(“info",«,X) € msgs-to-vs[g]p or (“info”,x,X) € pending|p, g] or ((“info", x, X),p) € 
queuelg| or info-rcvd[p, g\q = (x, X). 


If C is true then info-sent[g], = (x, X) and cur.id, > g. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,q,g,x and X. In the initial state msgs-to-vs[g|, = , 
pending|p, g] = A, queuel[g] = » and info-rcvd[p, g], = L. Hence, in the initial state, C' is false and 
the invariant is vacuously true. 

For the inductive step assume that the invariant is true in a reachable state s. We need to prove 
that it is true in state s’ for any possible step (s,7, s’) of the execution. Fix p,q,g,xz, and X and 


assume that C is true in s’. 


1. m7 = vS-NEWVIEW(v)p. 


By the code of a, s'.cur, = v. Assume v.id # g. Then the code of a shows that none of 
msgs-to-us[g]p, pending|[p, g], queue|g] or info-rcvd[p, g\q is changed during this step. Thus C 
is true also in s. By the inductive hypothesis we have s.info-sent[g|, = (x, X) and cur.id, > g. 
Since we are considering the case v.id # g, we have that info-sent|g], is not changed by a. 
Moreover the precondition of a (see vs) shows that s’.current-viewid|p] > s.current-viewid[p]. 
By Invariant 5.2.1, cur.id, = current-viewid[p], so s'.cur.id, > s.cur.id,. This completes 


showing the conclusion for the situation w.id # g. 


Assume now v.id = g. The code shows s'.cur.id, = g as required. It remains to show that 
(x, X) € info-sent[g]p. 

Action a does not alter the values of pending[p,g], queue|g] and info-revd[p,g]q and ap- 
pends (“info", s.actp,s.ambp) to msgs-to-vs|g],. We claim that it must be = s.act, and 
X = s.amb,. Indeed if it is not so, then condition C is true also in state s (for the given 
P;4; 9,2, X) and by the inductive hypothesis we have s.cur.id, > g = w.id. By Invariant 5.2.1, 
s.current-viewid|p] > w.id. But this contradicts the precondition of 7 (see vs). 

Thus x = s.act, and X = s.amb,. Then the code of 7 shows that (x, X) € info-sent|g],, as 


required. 


2. 7 = vs-GPRov((“info”,v,V))p,q - 


If g # cur.id, then since C is true in s’ it is true also in s (for the given p,q,g,x,X). Thus 
the inductive hypothesis is true. Since the code does not change info-sent[g], and cur.id,, the 


invariant follows from the inductive hypothesis. 


Hence assume that g = cur.id,. First consider the case x = v and X = V. In this case, by 
the precondition of 7 (see vS) we have that ((“info",x, X),p) € queue[g]. Then the invariant 


follows from the inductive hypothesis. 
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Consider now the case  # v or X # V. In this case, by the code, we have that s’.info-rcvd|p, g]q Z 
(x, X). Since C is true in s’, it must be that (“info 2, X) € msgs-to-vs|g], or (“info”,x,X) € 
pending|p, g] or ({“info",x, X),p) € queue[g] is true in s’. Variables msgs-to-vs|g],, pending[p, g] 
and queue|g] are not changed by x. Hence C is true in s. The invariant follows from the in- 


ductive hypothesis. 


3. 7 = VS-GPSND((“info!’,v,V))p. 


If g # client-cur.id, then since C is true in s’ it is true also in s (for the given p,q, 9,2,X). 
Thus the inductive hypothesis is true. Since the code does not change info-sent[g], and cur.idp, 


the invariant follows from the inductive hypothesis. 


Hence assume that g = client-cur.id,. First consider the case x = v and X = V. In this case, 
by the precondition of 7 (see DVS-IMPL) we have that ((“info", x, X), p) € msgs-to-vs|g]. Then 


the invariant follows from the inductive hypothesis. 
Consider now the case x # v or X # V. Since C is true in s’ we have that C is true in s too. 
Indeed no (“info ,z,X) message is deleted and info-rcvd[p, g\q is not changed. The invariant 
follows from the inductive hypothesis. 

4, 7 = Vs-ORDER((“info”,v,V),p, 9). 


First consider the case x = v and X = V. In this case, by the precondition of 7 we have that 


((“info"", x, X),p) € pending|g]. Then the invariant follows from the inductive hypothesis. 


Consider now the case « 4 v or X # V. Since C is true in s’ we have that C is true in s too. 
Indeed no (“info”,z,X) message is deleted and info-rcvd[p, g\q is not changed. The invariant 


follows from the inductive hypothesis. 


5. Other actions. 


Condition C never changes from false to true and variables info-sent[g], and cur.id, are not 


modified. Hence the assertion cannot be made false. 


0 


The following invariant states that if a “registered” message for view vu has been sent by process 


p then variable reg[v.id], is set to true (that is, the view has been registered by the client at p). 


Invariant 5.2.8 (DVS-IMPL) 


In any reachable state, let C be the following condition: 


(“registered”) € msgs-to-vs[g],, or (“registered") € pending|p,g] or (“registered",p) € 


queuelg| or rcvd-rgst[p, g|q = true. 


If C is true then reg|g|p = true. 
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Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,g,g. In the initial state we have that msgs-to-vs[g],, = A, 
pending|p, g] = A, queue[g] = A and revd-rgst[p, g\q = false. Hence C is false in the initial state 
and the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ 


for any possible step (s,7, 8’). Fix p,g,q and assume that C is true in s’. 


1. 7 = DVS-REGISTERp. 
If s.client-cur.id, A g then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume s.client-cur.id, = g. By the code of a we have that we have 
reg[g]p = true. 

2. 1 = VS-GPSND({“registered" ))p. 


If s.current-viewid|p] 4 g then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume g = s.current-viewid|p]. By Invariant 5.2.1 we have that s.cur.id, = 
s.current-viewid|p]. Hence s.cur.id, = g. By the precondition of a (see DVS-IMPL) we have 
that (“registered") € s.msgs-to-vs|g|,. Hence C is true in s and the invariant follows from the 
inductive hypothesis. 


3. 7 =VS-ORDER((“registered” ,p', g')). 


If p' # p or g' # g then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume p' = p and g’ = g. By the precondition of 7 we have that 
(“registered”) € s.pending|p,g]. Hence C is true also in s and the invariant follows from 


the inductive hypothesis. 


4. Other actions. 


Condition C never changes from false to true and variable reg|g], is not modified. Hence the 


assertion cannot be made false. 


The following invariant states some facts about views in JotReg. 


Invariant 5.2.9 (DVS-IMPL) 


In any reachable state: 
1. acty € TotReg. 
2. If info-sent|g|, = (,X) then x € TotReg. 


3. use, N TotReg # 0. 
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Proof: First notice that Part 3 follows easily from Part 1 and the fact that, by definition, actp € usep. 
Hence we only need to prove Parts 1 and 2. 

By induction on the length of the execution. The base case consists of proving that the invariant 
is true in the initial state. For Part 1, fix p. In the initial state act, = vo and vo is totally registered 
by definition. For Part 2, fix p,g. In the initial state info-sent[g], = L. Hence the invariant is 
vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s’ for 
any possible step (s,7,s'). Fix p, g, 2 and X. We prove the invariant by considering each possible 


action. 


1. m7 = VS-NEWVIEW(v)p. 
Part 1 is still true in s’ because act, is not modified (as well as JotReg). 


Consider Part 2 now. Assume that s’.info-sent[g], = (a, X). If v.id # g then s.info-sent[g], = 
(x, X) then by the inductive hypothesis we have that x € s.JotReg. Since no view is ever 
removed from JotReg we have that x € s’.JotReg, as needed. Hence we can further assume 
that v.id = g. Since s'.info-sent|g], = (x,X) and action zm sets info-sent|g], = (act,, amb,) it 
must be that s.act, =x and s.amb, = X. 

By the inductive hypothesis, Part 1, we have that s.act, € s.JotReg. But x = s.act, and no 
view is removed from JotReg. Hence x € s’.JotReg. Thus Part 2 is still true in s’. 


2. 7 = vs-eprev({ “info” ,v,V))p,¢+ 


Consider Part 1 first. If s’.act, = s.act, then Part 1 follows by the inductive hypothesis. Hence 
assume that s’.act, # s.act,. By the code we have that s’.act, = v. Thus we have to prove 
that v € TotReg. By the precondition of 7 (in vs) we have ((“info”,v,V),q) € s.queue[cur.id,]. 
Then Invariant 5.2.7 implies that s.info-sent|cur.idp], = (v,V). By the inductive hypothesis, 
Part 2, we have that v € s.JotReg, as needed. 


Part 2 is preserved because info-sent|g], is not modified. 


3. 7 = DVS-GARBAGE-COLLECT(v)p. 


Consider Part 1 first. If s’.act, = s.act, then Part 1 follows by the inductive hypothesis. 
Hence assume that s’.act, # s.actp. By the code we have that s'.act, = v. Hence we have to 
prove that v € JotReg. By the precondition of 7 we have that rcvd-rgst|q, v.id] = true for all 
q € v.set. Then Invariant 5.2.8 implies that v € JotReg. 


Part 2 is preserved because info-sent|g], is not modified. 


4. Other actions. 
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Variables actp, info-sent[g], (as well as JotReg) are not modified. Hence the assertions cannot 


be made false. 


O 


The following invariant states that if process g is in a view which has been attempted by process p 


(which may or may not be gq itself) then the current view of gq is either v or a later one. 


Invariant 5.2.10 (DVS-IMPL) 


In any reachable state, if v € attempted, and q € v.set then cur.id, > v.id. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,v and suppose that v € attempted, and q € v.set. If 
p ¢ Po then attempted, = $, acontradiction. On the other hand, if p € Po then since v € attempted, 
it must be that v = vo. Moreover since q € v.set we have that q € Po. Hence curg = vo, so 
cur.idg > v.id, as needed. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,7, s’). Fix p and v and assume that v € s’.attempted, and q € v.set. We 


distinguish two cases. 


1. v € s.attempted,,. 
By the inductive hypothesis we have that s.cur.idg > v.id. By the monotonicity of cur.id we 
have that s'.cur.idg > s.cur.idg. 

2. v ¢ s.attempted,,. 
It must be 7 = pvs-NEwview(v),. We consider two possible cases: g = p and q # p. 


Assume that q = p. Then Invariant 5.2.2 implies that s’.client-cur, > v.id. Since s'.cur.id, = 


s'.client-curp, we have that s’.cur.id, > v.id, as needed. 


Assume that g 4 p. Then the precondition of 7 says that s.info-rcvd[q, v.id] # L. By Invariant 
5.2.7 (used with p and q interchanged) we have that cur.id, > v.id, as needed. 


The following invariant states properties of views in the use set. 


Invariant 5.2.11 (DvS-IMPL) 


In any reachable state: 
1. If curp FL and w € usey, then w.id < cur.idy. 


2. If curp # L and client-cur, # cur, and w € useép, then w.id < cur.idy. 
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3. If info-sent|g], = (z,X) and w € {a}UX then w.id < g. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Consider Part 1 first. In the initial state we have that use, is 
either empty or contains only vo. In the former case Part 1 is vacuously true. In the latter case we 
have that w = vg and the invariant follows from the fact that go is the minimum element of G. Parts 
2 and 3 are vacuously true. Indeed in the initial state client-cur, = cur, and info-sent|g], = LL. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,7, 8’). Fix p,g,z,X and w. 


We prove that the invariant is still true in s’ by considering each possible action 7. 


1. 7 = VS-NEWVIEW(v) p 


First consider Part 1. Assume that s’.cur, A L and w € s'.usep. Then w € s.usey. If 
s.cur, = L, then, by Invariant 5.2.6, w = vo. Since vo.id is the minimum element of G, 
we have that w.id < s'.cur.id,. So assume that s.cur, # 1. In this case, by the inductive 


hypothesis, Part 1, we have that w.id < s.cur.id,, which implies w.id < s’.cur.id,. 


Hence Part 1 is still true in s’. Since we actually proved that w.id < s’.cur.id, also Part 2 is 


still true in s’. 


Now consider Part 3. Assume that s’.info-sent|g], = (x, X) and w € {x}UX. If g # v.id then 
we have that s.info-sent[g|, = (,X). By the inductive hypothesis, Part 3, we have w.id < g, 
as needed. Hence assume g = v.id. By the code of 7, we have that s.use, = {x} UX. Now if 
s.cur, = L, then by Invariant 5.2.6, w = vo. Since vo.id is the minimum element of G, we have 
that w.id < v.id = g, as needed. So assume further that s.cur, # L. In this case, the inductive 
hypothesis, Part 1, implies that w.id < s.cur.id,, which implies w.id < s'.cur.id, = v.id = g, 


as needed. 


2. 1 = DVS-NEWVIEW(v)p 


Consider Part 1 first. The only possible new element added to use, is v. Since v = s’.cur.id, 
Part 1 still holds in s’. Part 2 is vacuously true, because s'.client-cur, = s'.curp. Part 3 is 


preserved because info-sent[g], is not modified. 


3. 7 = DVS-GARBAGE-COLLECT(V) p 


Consider Part 1. Assume that s’.cur, # 1 and that w € s'.use,. By the code s'.curp = s.curp. 
If w € s.use, then by the inductive hypothesis Part 1 is true in s and thus it is still true in s’. 
Hence assume that w ¢ s.use,. By the code, this cannot happen because no view is added to 


USED. 


Part 2 can be proved in a similar way. Part 3 is preserved because info-sent[g], is not modified. 
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4, 7 = vs-eprov(({“info” ,2,X))q,p 


The proof is exactly as in the previous case. 


5. Other actions. 


Variables usep, curp, client-cur, and info-sent|g], are not modified. Hence none of the asser- 


tions can be made false. 


0 


The following three invariants, say that certain views appear in use sets, or in “info” messages, 


unless they have been garbage-collected. 


Invariant 5.2.12 (DvVS-IMPL) 


In any reachable state, if w € attempted, then either w € usep or w.id < act.idy. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,w and suppose that w € attempted,. If p ¢ Po then 
attempted, = $, a contradiction. On the other hand, if p € Po then since w € attempted, it must 
be that w = vo. But in this case also act, = v9, 80 vo € usep, as needed. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s’ for any possible step (s,7,s’). So fix w and p such that w € s’.attempted,. We distinguish two 


possible cases. 


1. w € s.attempted,,. 


By the inductive hypothesis we have that either w € s.use, or w.id < s.act.id,. In the 
latter case, because of the monotonicity of act.id,, we have w.id < s'.act.id,. So assume that 
w € s.usep. If w € s'.usep we are done, so assume further that w ¢ s’.use,. Then it must be 
that either 7 = pvs-GARBAGE-COLLECT(v)p Or 7 = vs-GPRCV((“info",x,X))r,» for some r. In either 


case, the code implies that s’.act, > w.id. 


2. w ¢ s.attempted,,. 


It must be 7 = pvs-Newview(v)p. By the code, view v is inserted into attempted, but also into 


amb, (and hence into use,). Thus the invariant is still true in s’. 


Invariant 5.2.13 (DVS-IMPL) 
In any reachable state, if info-rcvd[q,g]p = («,X) and w € {x}UX, then either w € usep or 


w.id < act.idy. 
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Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state info-rcvd[q, g]) = L for any p,q,g. Hence 
the statement is vacuously true. 
For the inductive step assume the invariant is true in state s. We need to prove that it is true in s’ 
for any possible step (s, 7,8’). Fix p, q, g, x, X and w, and assume that s’.info-rcvd|q, g|p = (2, X), 
and w € {x}UX. We consider two cases: 
1. s.info-revd|q, g]p = (z, X) 
By the statement applied to s, we obtain that either w € s.usep, or s.act.id, > w.id. In the 
latter case, s’.act.id, > w.id, because of monotonicity of act.id,. So assume that w € s.usep. 
If w € s'.usep then we are done, so assume further that w ¢ s'.usep. (That is, w is garbage- 
collected.) 


Then it must be that either 7 =pvs-GaRBAGE-COLLECT(v)p Or 7 = vs-GPRCV((“info!’,2,X))>,» for 


some r. In either case, the code implies that s’.act, > w.id. 


2. s.info-revd[q, g]p F(z, X) 


Then a = vs-eprov((“info",2,X))qp- If w € s'.use, then we are done. Hence assume that 


w ¢ s'.usep. By the code, we have that s’.act, > w.id (that is, w is garbage-collected). 


Invariant 5.2.14 (DVS-IMPL) 
In any reachable state, if info-sent|g], = (z,X), w € attempted,, and w.id < g, then either w € 
{t}UX or wid < x.id. 


Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, info-sent|g], = 1 for all g,p, so the 
statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,7, s’). Fix p, g, w, «, and X, and assume that s’.info-sent|g], = (x, X), 


wes’ .attempted,, and w.id < g. We consider four cases: 
1. s.info-sent[g|, = (x, X) and w € s.attempted,. 
Then the statement for s implies that either w € {x} UX or w.id < x.id. In either case the 
statement is true in s’ also. 
2. s.info-sent|g]p A (x, X) and w ¢ s.attempted,. 


This cannot happen because both conditions cannot become true in a single step: the first only 
becomes true by means of a vs-NEwviEW(v),, for some view v, while the second only becomes 


true by means of pvs-NEWVIEW(w)p. 
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3. s.info-sent|g]p # (x, X) and w € s.attempted,. 


It must be 7 = vs-newview(v)p, for some v, x must be s.act,, and X must be s.amb,. In- 
variant 5.2.12 implies that either w € s.usep or w.id < s.act.idp. Now, s.usep = {s.actp}U 


s.amb, = {x} UX. So we have that either w € {w} UX or w.id < x.id, as needed. 


4. s.info-sent|g]p = (x, X) and w ¢ s.attempted,. 


Then 7 must be pys-newview(w)p. We claim that this cannot happen: Since s.info-sent[g], = 
(x, X), by Invariant 5.2.3 we have s.cur.id, > g. Since g > w.id, we have s.cur, > w.id. But 


the precondition of 7 requires that s.cur, = w.id. Hence 7 is not enabled in state s. 
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Invariant 5.2.15 says that two attempted views having no intervening totally registered view, 
and having a common member, gq, that has attempted the first view, must intersect in a majority 
of processors. This is because, under these circumstances, information must flow from gq to any 


processor that attempts the second view. 


Invariant 5.2.15 (DVS-IMPL) 


In any reachable state, suppose that v € attempted,, q € v.set, w € attempted,, w.id < v.id, and 


q? 
there is no x € TotReg such that w.id < «id < v.id. Then |v.set N w.set| > |w.set|/2. 


Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only vo is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,7, 8’). Fix v, w, p, and q, and assume that v € s’.attempted,, q € v.set, 


w € s'.attempted,, w.id < v.id, and there is no x € s'.JotReg such that w.id < x.id < v.id. Then 


q? 
also there is no x € s.JotReg such that w.id < x.id < v.id. We consider four cases: 
1. vu € s.attempted, and w € s.attempted,. 


Then the statement for s implies that |v.set N w.set| > |w.set|/2, as needed. 


2. vu ¢ s.attempted, and w ¢ s.attempted,. 
This cannot happen because we cannot have both v and w becoming attempted in a single 
step. 

3. u¢ s.attempted,, and w € s.attempted,. 


Then z must be pvs-newview(v)p- Since g € v.set, by the precondition of a we have that 
s.info-revd|q, v.id|, = (x, X) for some z and X. Then Invariant 5.2.7 implies that s.info-sent[v.id], = 


{x,X). Then (since w.id < v.id), Invariant 5.2.14 implies that either w € {x}U X or 
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wid < aid. If wid < a.id, then we obtain a contradiction. Indeed by Invariant 5.2.9 
x € s.JotReg and by Invariant 5.2.11, Part 3 (used with w = x) we have z.id < v.id. This 
contradicts the hypothesis. So w € {z}UX. 

Now by Invariant 5.2.13 we have that either w € s.usep or w.id < s.act.id,. In the former 
case, by the precondition of 7, we have |v.setN w.set| > |w.set|/2. In the latter case, we obtain 
a contradiction. Indeed by Invariant 5.2.9 we have s.act, € TotReg. Moreover by the precon- 
dition of 7, s.cur, cannot be | and s.cur, > s.client-cur, and, by definition, s.act, € s.useép. 
Hence by Invariant 5.2.11, Part 2, we have s.act.id, < s.cur.id, = v.id. Thus we would have 
a totally registered view act such that w.id < act.is < cid. This contradicts the inductive 


hypothesis. 


4. v € s.attempted, and w ¢ s.attempted,. 


Then must be pvs-Newvinw(w),. But this cannot happen. Indeed since v € s.attempted,, 
and q € v.set, Invariant 5.2.10 implies that s.cur.id, > v.id. Since v.id > w.id, we have 
s.cur.idy > w.id. But the precondition of action a requires s.cur.id,g = w.id, so m is not 


enabled in s. 
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Invariant 5.2.16 says that any attempted view v intersects the latest preceding totally registered 


view w in a majority of members of w. 


Invariant 5.2.16 (DVS-IMPL) 
In any reachable state, suppose thatv € Att, and w € TotReg, w.id < v.id, and there is no x € TJotReg 
such that w.id < 2.id < v.id. Then |v.setN w.set| > |w.set|/2. 


Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only vo is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in s’ 
for any possible step (s,7, 8’). Fix v and w, and assume that v € s'.Att, w € s'.JotReg, w.id < v.id, 


and there is no x € s'.JotReg such that w.id < x.id < v.id. We consider four cases: 
1. v€ s.Att and w € s.JotReg. 


Then, from the inductive hypothesis we have |v.set N w.set| > |w.set|/2. 


2. v ¢g s.Att and w ¢ s.TotReg. 


This cannot happen because we cannot have both v becoming attempted and w becoming 


totally registered in a single step. 
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3. v ¢ s.Att and w € s.JotReg. 


Then 7 must be pvs-newview(v), for some p. The precondition of 7 implies that, for any view 
y € S.usey, |v.set NM y.set| > |y.set|/2. Hence to prove the claim it is enough to prove that 


w € 8.use,. We proceed by contradiction assuming that w ¢ s.usep. 


By Invariant 5.2.9, Part 3, s.use, 9 s.JotReg # Q. Let m be the view in s.use, N s.TJotReg 
having the biggest identifier. We know that m # w because w ¢ s.use,. Also, m # v, because 


m € s.JotReg and v ¢ s.JotReg. It follows that m.id # v.id. 


We claim that m.id < w.id. We have already shown that m.id #4 w.id. Suppose for the sake 
of contradiction that m.id > w.id. From the precondition of action a we have that s.cur = v 
and hence s.cur # L. Also from the precondition of 7 we have that s.client-curp < s.curp. 
Since m € s.usep, Invariant 5.2.11, Part 2, implies that m.id < s.cur.id, and since s.cur = uv 
we have we have m.id < v.id. So w.id < m.id < v.id. Since m € s'.TotReg, this contradicts 


the hypothesis of the inductive step. Therefore, m.id < w.id. 


Let n be the view in s.JotReg that has the smallest id strictly greater than that of m. Remember 
that w € s'.JotReg and since 7 =pvs-NEWVIEW(v), we have that w € s.JotReg; thus n exists and it 
holds m.id < n.id < w.id < v.id. Since m € s.use,, the precondition of 7 implies that |v.setN 
m.set| > |m.set|/2. By the statement applied to state s, |n.set N m.set| > |m.set|/2. Hence 
there exists a processor q € v.setMn.set. By the precondition of 7, s.info-rcvd[q, v.id], = (a, X) 
for some x, X. Then Invariant 5.2.7 implies that s.info-sent[v.id], = (x, X). Then Invariant 
5.2.11, Part 3 (used with w = x), implies that x.id < v.id. Since n € s.Tot’Reg, we have that 
n € s.attempted,. Then Invariant 5.2.14 (used with w = n) implies that either n € {x} UX 
or n.id < x.id. In either case, {x} U X contains a view y € s.JotReg (either n or x) such that 
nid < y.id < v.id. Then Invariant 5.2.13 implies that either y € s.usep or y.id < s.act.id,. By 
Invariant 5.2.9, Part 1, s.act, € s.JotReg and by definition, s.act, € s.usep. So in either case, 
the hypothesis that m is the totally registered view with the largest id belonging to s.use, is 


contradicted. 


4. v € s.Att and w ¢ s.JotReg. 


Then za must be pvs-recisteR, for some p. Let m be the view in s.JotReg with the largest id 
that is strictly less than w.id. By the statement for s, we know that |w.setMm.set| > |m.set|/2 
and |v.set N m.set| > |m.set|/2. Hence there is a processor g € w.set Q v.set. 

Since v € s.Att, there exists a processor r such that v € s.attempted,. Thus also v € 
s'.attempted,,. Since w € s'.JotReg, we have that w € s'.attempted,. By assumption, there 
is no view x € s’.JotReg such that w.id < x.id < v.id. By Invariant 5.2.15 applied to state s’ 
(with p =r), we have that |v.set MN w.set| > |w.set|/2, as needed. 


ov 


The final invariant, a corollary to Invariant 5.2.16, is instrumental in proving that DvS-IMPL 


implements DVS. 


Invariant 5.2.17 (DVS-IMPL) 
In any reachable state, ifv,w € Att, wid < v.id, and there is nox € TotReg with w.id < a.id < v.id, 
then v.set 1 w.set £0. 


Proof: Suppose that v and w are as given. We consider two cases. 


1. w € TJotReg. 


Since there is no « € TotReg, Invariant 5.2.16 implies that |v.set N w.set| > |w.set|/2, which 
implies that v.set NM w.set # {}, as needed. 


2. w ¢ TotReg. 
Then let Y = {yly € TotReg, y.id < w.id}. We first show that Y is nonempty: Invariant 5.2.4 
implies that v9 € JotReg and that vo.id < w.id. If vg.id = w.id, then by Invariant 4.1.1, we 
have w = vo. But then w € JotReg, a contradiction to the definition of this case. So we must 


have ug.id < w.id, which implies that v9 € Y, so Y is nonempty. 


Now fix z to be the view in Y with the largest id. We have that there is no x € JotReg 
with z.id < x.id < v.id. Then Invariant 5.2.16 implies that |w.setN z.set| > |z.set|/2 and 
|u.setN z.set| > |z.set|/2. Together, these two facts imply that v.set MN w.set # {}, as needed. 


O 


5.2.3 Proof that Dvs-IMPL implements Dvs 


We prove that DvS-IMPL implements DVS by defining a function Fay, that maps states of DVS-IMPL 
to states of Dvs and proving that this function is a abstraction function. Section 5.2.3 contains the 
definition of the function Fg,, and some auxiliary lemmas while Section 5.2.3 contains the proof 


that Fg,, is an abstraction function. 


The abstraction function for DVS-IMPL. 


DVS-IMPL uses Vs to send client messages and messages generated by the implementation ( “info” 
and “registered” messages). The abstraction function discards the non-client messages. Thus, if q 
is a finite sequence of client and non-client messages, we define purge(q) to be the queue obtained 
by deleting any “info” or “registered” messages from q, and purgesize(q) to be the number of “info” 
and “registered” messages in qg. Figure 5-4 defines the abstraction function Fgys. 

Next we give some simple consequences of the definition of Fg,,. They deal with the messages 
delivered by DvS-IMPL. They state that these messages are exactly the ones that Dvs would deliver 


to the client. 
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Let s be a state of DvS-IMPL. The state t = Fay5(s) of DVS is the following. 
e t.created = Upeps.attempted,, 


e for each p € P, t.current-viewid|p] = s.client-cur.idp 


e for each g € G, t.attempted|g| = {plg = v.id, v € s.attempted, } 


for each g € G, t.registered|g] = {p|s.reg|g]p} 
for each p € P, g € G, t.pending|p, g] = purge(s.pending|p, g])opurge(s.msgs-to-vs|g] p) 
for each g € G, t.queue[g] = purge(s.queue[g]) 


for each pe P, Gg EG, 
t.neat|p, g] = s.neat[p, g| — purgesize(s.queue[g](1..next[p, g] — 1)) — |s.msgs-from-vs[g]p| 


for each pE P,g EG, t.next-safel[p, g] = 
s.next-safelp, g] — purgesize(s.queue[g](1..next-safe[p, g] — 1)) — |s.safe-from-vs[g]p| 


Figure 5-4: The abstraction function Fq,5. 


Invariant 5.2.18 (DVS-IMPL) 
In any reachable state s, if s.msgs-from-vs[g|, = ((m1, 1), (M2, 2), +, (Mn, Qk)), then we have that 


Favs(s).queue[g|(neat[p, g]..neat|p, g] + k — 1) = ((mi, qr), (ma, ga), 5 (Mk, dh))- 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state no message is in msgs-from-vs|g|,. Hence the 
invariant is vacuously true. 

For the inductive step, assume that the invariant is true in state s. We need to prove that it is 
true in state s’ for any possible step (s,7, s'). Fix p,g and mj, qi, M2, go, .--, Mk, qx and assume that 


s'.msgs-from-vs|g]p = ((m1, q1), (Me, ge), ---; (Mz, de)). We distinguish the following cases. 


1. s.msgs-from-vs{g]p = (m1, 41). mae—1,44-1))- 
It must be a =vs-ePrcv(mp)q,,p- By the inductive hypothesis we have that 
Fays(s).queue[g](nezt|p, g]..next[p, g] + k — 2) = (m1, 1), ---) (M15 Ue—-1))- 


By the code in vs we have that next[p, g] is increased by one and by the code in Dvs we have that 
the size of msgs-from-vs|g]p also increases by one. Hence by the definition of Fa,,, we have 
that Favs(s').next[p, 9g] = Favs(s).next[p, g]. Moreover Fays(s').queue[g] = Favs(s).queuel[g] 
and by the precondition of 7 we have that Fay.(s).queue[g](s.neat[p, g] + k — 1) = (mz, a). 


Thus the invariant is still true in s’. 


2. s.msgs-from-vs|g]p = ((m, q); (m1, M1), (ma, q2), sees (mn, Gk))+ 
Then 7 =pvs-Gprcv(m)q,». By the inductive hypothesis we have that 


Favs(s).queue[g|(neat[p, g]..next[p, g] + k) = ((m,q), (m1, G1), (M2, G2); +s (Me-15 Ik-1))- 
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By the code we have that next[p, g] is incremented by one. Since Fays(s'). quewelg] = Favs (s).queue[g], 


the invariant is still true in s’. 


3. s.msgs-from-vs|g], = s'.msgs-from-vs[g]p 
By the inductive hypothesis the assertion is true in state s. For any possible action in this 
case Fays(s’).next[p, g] = Favs(s).neat|p, g] and the portion of Fy,,(s).queue[g] involved in the 
statement of the invariant never changes because messages are only appended to queuelg]. 
Thus the assertion cannot be made false. 


4. Other cases. 


Not possible. Indeed msgs-from-vs|g], either stay the same or is changed by appending a 
message or deleting the head. 
O 


The following invariant follows easily from the previous one. It just states that the next message 


delivered by DVS-IMPL to a processor p is the same one that Dvs delivers. 


Invariant 5.2.19 (DVS-IMPL) 


In any reachable state s, if (m,q) is head of s.msgs-from-vs|g]p, then Favs(s).queue|g](nezt[p, g]) = 
(m, q). 
Proof: Follows easily from previous one. O 


Similar invariants hold for the delivery of safe messages. 


Invariant 5.2.20 (DVS-IMPL) 
In any reachable state s, we have that if s.safe-from-vs[g]p = ((m1, @1), (m2, G2), --, (Me, Uk), then 
Favs (s).queue|g](nezt-safelp, gl; next-safe|p, 9] ae 1) = ((ma, m1); (me, q2); tiny (mk, dk))- 


Proof: The proof is as for msgs except that it uses the safe-from-vs queue instead of msgs-from-vs 


and the pointer nezt-safe instead of nezt. 0 


Invariant 5.2.21 (DVS-IMPL) 


In any reachable state s, if (m,q) is head of s.safe-from-vs[g]p, then Favs (s).queue[g]|(next-safe[p, g]) = 
(m, q). 


Proof: Follows easily from previous one. iW 


Notice that v is totally registered in state s of DvS-IMPL if and only if it is totally registered in the 


state of Dvs that appears in state Fy,,(s) of Dvs. 
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Proof that F3,, is abstraction an function. 


In order to prove that Fg,; is an abstraction function we need to prove that for any initial state 
s of DVS-IMPL we have that Fay;5(s) is an initial state of DvS and that for any possible step 7 of 
DVS-IMPL there exists a sequence of a of steps of Dvs such that the trace of a, that is the externally 


observable behavior, is equal to the trace of 7. Lemmas 5.2.22 and 5.2.23, prove the above. 
Lemma 5.2.22 If s is an initial state of DVS-IMPL then Fay,(s) is an initial state of DVS. 


Proof: Let sg be the unique initial state of DVS-IMPL and tg the unique initial state of DVS. 

We have so.attempted, = {vo} for p € Po and so.attempted, = 0) for p ¢ Po. By the definition of 
Favs and the fact that Po 4 @ (because all membership sets are defined to be nonempty), we have 
Favs(8o).created = {vo}. This is as in to. 

We have so.client-cur, = {vo} for p € Po and so.client-cur, = 1 for p ¢ Po. By the definition 
of Fays we have Fays(s0).current-viewid|p| = go for p € Py and Fays(s0).current-viewid[p| = L for 
p¢ Po. This is as in to. 

We have sq.attempted, = {vo} for p € Po and s.attempted, = 0 for p ¢ Po. By the definition of 
Favs we have Fays(so).attempted|go| = Po and Fays5(so).attempted|g] = 0 for g # go. This is as in to. 

Let g € G. We have that so.reg[g], is true if and only if p € Po and g = go. By the definition of 
Favs we have Fays(so).registered[go] = Py and Fays(so).registered|[g| = @ for g £ go, as in to. 

Let p € P. We have that so.msgs-to-vs[g]p = and so.pending[p, g] = . By the definition of 
Favs we have Fays(so).pending|p, g| = A, as in to. 

Let g € G. We have so.queue|g] = A. By the definition of Fy,, we have Fa,s(so).queue|[g] = A, 
as in fo. 

Let p€ P,g € G. We have so.next[p, g| = 1, purgesize(s.vs.queue[g]) = 0 and so.msgs-from-vs[g], = 
X. By the definition of Fg,, we have Fays(so).nezt[p, g] = 1, as in to. A similar argument holds for 
nezt-safe. 


Thus Fays(so) = to, as needed. 0 


Lemma 5.2.23 Lets be a reachable state of DVS-IMPL, Fay5(s) a reachable state of DVS-SYS, and 
(s, 7, 8') a step of DVS-IMPL. Then there is an execution fragment a of DVS-SYS that goes from 


Favs(s) to Favs(s'), such that trace(a) = trace(x). 


Proof: By case analysis based on the type of the action 7. (The only interesting case is where 7 = 


DVS-NEWVIEW(v)p.) Define t = Fays(s) and t' = Fays(s’). 


1. m7 = vs-CREATEVIEW(v) 


Then trace((s, a, s’)) = A. Action « modifies created. The definition of Fg,, is not sensitive 


to this change. Therefore, t = t’, and we set a = t. 
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2. 1 = VS-NEWVIEW(V)p 


Then trace((s, 7, s’)) = A. Action z modifies cur,, info-sent|cur.id],, and current-viewid[p], 
and adds an “info'’ message to msgs-to-vs[cur.id],. The definition of Fa,, is not sensitive to 


any of these changes. Therefore, t = t', and we set a =t. 


3. 7 = VS-GPSND(m)p 


Then trace((s,7,s')) = . Action 7 just moves a message from the queue msgs-to-vs[cur.id], 
to the queue pending|p, current-viewid[p]]. The definition of Fg, is not sensitive to this change. 


Therefore, t = t’, and we set a =t. 


4. 7 = vs-ORDER(m, p, 9) 


Then trace((s,7,s')) = 4. Action 7 moves a message from pending|p, g] to queuelg]. We 


consider two cases. 


(a) mEM, 
Then we set a = (t, pvs-oRDER(m, p,g), t'). We claim that pvs-orper(m,p,g) is enabled in t: 
Since vs-orpER(m,p,g) is enabled in s, it follows that m is the head of s.pending[p, g]. By 
the definition of Fa,;, m is also the head of t.pending[p, g]. It follows that pvs-orpER(m, p, 9) 
is enabled in ¢. 
By definition of Fq,;, t’ differs from ¢ only in the fact that m is moved from pending[p, g] 
to queue|g]. This is the effect achieved by applying pvs-orpER(m, p,g) to t. 


(b) mé MM. 
Then the definition of Fay, is not sensitive to this change. Therefore, ¢ = t', and we set 


a=t. 


0. 7 = vs-GPROV(( “info” ,v, 8))q,p 
Then trace((s,7,s')) = A. This action can modify info-revd|cur.idp, q|p, actp and amb, (see 
code of Dvs) and causes neat|p, cur.id,] to be incremented (see code of vs). The defini- 
tion of Fy,, is not sensitive to these changes. (The only interesting case is the definition of 
t.nezt[p, cur.id,], where the absolute values of the first two terms on the right-hand side are 


both increased by 1, but they cancel each other out.) Therefore, t = ¢’, and we set a = t. 


6. 7 = vs-ePpRcv(“registered” )p 


Then trace((s,7,s') = A. This action can modify revd-rgst|cur.id,q|p. It also causes the 
pointer nezt[p, cur.id,] to be incremented. The definition of Fy,; is not sensitive to these 
changes. (The only interesting case is the definition of t.nezt[p, cur.idp], where the absolute 
values of the first two terms on the right-hand side are both increased by 1, but they cancel 


each other out.) Therefore, ¢ = ¢’, and we set a = t. 


62 


G 


10. 


T = vs-GPRCV(m)p, MEM, 


Then trace((s,z,s')) = A. This action copies a message from the sequence queue[cur.id], to 
the sequence msgs-from-vs[p, client-cur[p]], and causes nezt[p, cur.id,] to be incremented. The 
definition of Fy,, is not sensitive to these changes. (The only interesting case is the definition 
of t.neat|p, cur.id,], where the absolute values of the first and third terms on the right-hand 
side are both increased by 1, but they cancel each other out.) Therefore, ¢ = ¢’, and we set 


a=t. 


1 = vs-saFE((m,v,8))qp, m € { “info”, “registered” } 

Then trace((s,7,s')) = A. Action a just causes next-safe[p, cur.id,] to be incremented. The 
definition of Fy, is not sensitive to this change. (The only interesting case is the definition of 
t.nezt-safe|p, cur.id,|, where the absolute values of the first two terms on the right-hand side 


are both increased by 1, but they cancel each other out.) Therefore, ¢ = t', and we set a =f. 


T = VS-SAFE(m)p, m € M, 

Then trace((s, 7, s’)) = A. Action 7 adds a message to safe-from-vs[cur.id], and causes the 
pointer neat-safe[p, cur.id,] to be incremented. The definition of Fy,, is not sensitive to 
these changes. (The only interesting case is the definition of t.next-safe|p, cur.id,], where the 
absolute values of the first and third terms on the right-hand side are both increased by 1, but 


they cancel each other out.) Therefore, ¢ = t’, and we set a =t. 


T = DVS-NEWVIEW(V)p 


Then trace((s,7,8)) =. In Dvs-IMPL, this action modifies only variables amb,, attempted, 
client-cur,. We have s'.client-cury = v and s'.attempted, = s.attempted, U {v}. By definition 
of Favs, we have that t'.current-viewid[p] = s'.client-cur.id, = v.id, t'.created = t.createdU {vu} 


and t’.attempted|v.id] = t.attempted[v.id] U {p}, while all other state variables in t' are as in t. 


We consider two cases: 


(a) v € t.created. 
In this case, we set a = (t,7’,t'), where a’ = pvs-newview(v)p. The code shows that 7’ 
brings Dvs-sys from state t to state t’. It remains to prove that 7’ is enabled in state t, 
that is, that v € t.created and v.id > t.current-viewid[p]. The first of these two conditions 
is true because of the defining condition for this case. The second condition follows from 
the precondition of 7 in DVS-IMPL: this precondition implies that v.id > s.client-cur.idy, 


and by the definition of Fa, we have t.current-viewid[p] = s.client-cur.idy. 
(b) v ¢ t.created. 
In this case we set a = (t,7',t",7",t'), where a’ = pvs-crEATEVIEW(v)p, 7” = pDvs- 


NEWVIEW(v)p, and t” is the unique state that arises by running the effect of 2’ from t. 
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The code shows that a brings Dvs-sys from state t to state ¢’. It remains to prove that 


nm’ is enabled in ¢ and that 2” is enabled in t”. 


The precondition of x’ requires that (i) Vw € t.created, v.id # w.id and (ii) Vw € 
t.created, either daz € s.JotAtt satisfying w.id < x.id < v.id or v.id < x.id < w.id, or else 
v.set Nw.set £ 0. 

To see requirement (i), suppose for the sake of contradiction that w € t.created and w.id = 
v.id. The precondition of 7 in DVS-IMPL implies that v = s.cur,, which implies that 
v € s.created. Since w € t.created, the definition of Fa,, implies that w € s.attempted, 
for some qg. This implies that w € s.created. But then Invariant 4.1.1 implies that v = w. 
But this contradicts that fact that v ¢ t.created and w € t.created. 

To see requirement (ii), suppose that w € t.created and there is no x € s.JotAtt satisfying 
wid < aid < v.id or v.id < rid < wid. Since w € t.created, by definition of Fay, 
w € s.attempted, for some q. Clearly, w € s'.attempted,. Therefore, w € s'.Att. By the 
code of 7 we have that v € s’.attempted,. Therefore we also have v € s’.Att. Moreover, 
there is no x € s’.JotAtt satisfying w.id < xid < v.id or v.id < z.id < wid. Then 


Invariant 5.2.17 implies that v.set N w.set # @, as needed to prove that 7’ is enabled in t. 


We now prove that 7” is enabled in state ¢’. The precondition of 7’ requires that 
v € t" created and v.id > t" .current-viewid|p|. The first condition is true because v 
is added to created by x’. The second condition follows from the precondition of 7 in 
DVS-IMPL: The precondition of 7 implies that v.id > s.client-cur.id,. The definition of 
Favs implies that t.current-viewid[p] = s.client-cur.id,. Moreover, t.current-viewid[p] = 
t.current-viewid[p]. It follows that v.id > t".current-viewid[p]. Thus x” is enabled in 


state t”. 


11. 7 =Dvs-REGISTERp 


12. 


Then trace((s, 7, s’)) = a. Let g be s.client-cur.id,, which equals t.current-viewid|p| by the 
abstraction function. If g = 1, then 7 has no effect in DVS-IMPL, so s = s‘; thus t = t’, as 
required to show that 7 brings Dvs from ¢ to t’. Otherwise, g # L, so by the code in Dvs- 
IMPL, this action sets reg[g], to true and inserts a “registered” message into msgs-to-vs[g]p. 
By definition of Fays, t' is the same as ¢ except that t'.registered|g] = t.registered|g]U {p}. We 


set a = (t, DvS-REGISTER,, t'). It is easy to check that pvs-recisteR, brings DVS-syYs from ¢ to t’. 


MT =DVS-GARBAGECOLLECT(v)p 


Then trace((s,7,s')) = A. This action can modify act, and amb,. The definition of Fy,, is 


not sensitive to these changes. Therefore, t = t’, and we set a =t. 
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13. a =pDvs-GPsND(m) p 


Then trace((s,7,s')) = 7. We set a = (t, pvs-cpsnp(m)p, t’). We consider two cases: 


(a) s.client-cur.id = L 
Then s = s’. In this case, the definition of Fay; implies that also t.current-viewid[p| = L, 


which implies that the action also has no effect in ¢, which suffices. 

(b) s.client-cur.id # L 
In this case, the action appends m to msgs-to-vs[g],, where g = client-cur.id,. Hence 
we have that s'.msgs-to-vs|g] = s.msgs-to-vs|g]om. By the definition of Fg), we get that 
t' .pending|p, g| = t.pending[p, gjom. This is the effect of the action in ¢ (using the fact 
that t.current-viewid|p| 4 1.) 


14. 7 = pvs-GpRov(m)p 


Then trace((s,7,s’)) = 2. This action removes the head of msgs-from-vs|g]p, where g = 
cur.idp. We have that s.msgs-from-vus[g], = mos'.msgs-from-vs[g]p. Thus t’.nezt[p, g] = 
t.nezt[p,g| +1. We set a = (t,pvs-cprcv(m)»,t’). It is easy to check that the step has the 
required effect in Dvs-sys. The fact that pvs-cprcv(m), is enabled in ¢ follows from Invari- 


ant 5.2.19. 


15. a = Dvs-sAFE(m)p 


Then trace(z) = a. This action removes the head of the safe-from-vs[g]p, where g = cur.idp. 
We have that s.safe-from-vs|g|, = mos'.safe-from-vs[g|,. Thus t!.neat-safe[p, g| = t.neat-safe[p, g)+ 
1. We set a = (t, pvs-cpRcv(m),,t’). It is easy to check that the step has the required effect in 


Dvs-sys. The fact that pvs-cprcv(m), is enabled in t follows from Invariant 5.2.21. 


Lemmas 5.2.22 and 5.2.23 prove that Fqg,, is an abstraction function from DVS-IMPL to DVS and 


thus the following theorem holds. 


Theorem 5.2.24 Every trace of DVS-IMPL is a trace of DVS-SYS. 


5.3 An application of Dvs 


In this section we show how to use Dvs to implement a totally ordered broadcast service, called TO. 
In Section 5.3.1 we give the specification of the totally ordered broadcast service TO, in Section 5.3.2 
we describe the implementation, which we call TO-IMPL, and in Section 5.3.3 we prove that TO-IMPL, 


implements TO. 
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5.3.1 The TO service 


The TO service was originally defined in [41]. This service accepts messages from clients and delivers 
them to all clients according to the same total order. This kind of service is a building block for 
many fault-tolerant distributed applications. The specification is reproduced in Figure 5-5. 

The following is an informal description of the service. Processes can broadcast messages by 
means of actions Bcast(a)p. Such a message a is appended to a queue local to process p, pending[p]. 
The service establishes a totally order on the messages by means of action To-orDER(a,p), which takes 
a message from the local queue of a process and puts it into a global quewe. The order established 
by this global queue is the one used to deliver messages. The pointer nezt|g] points to the next 


message in the global queue to be delivered to process g by means of action BRcv(a)p,q- 


TO 
Signature: 

Input: BCAST(a)p, aC A,pEP : 

Internal: TO-ORDER(a,p),a€ A, pe P Output: BROV(4)pq, 4 € A, p,g E P 
State: 

queue € segof(A x P), init » foreachp€P:  pending[p] € segof(A), init » 

next[p] € N°, init 1 

Transitions: 

input BCAST(a)p output BRCV(a)p,q 

Eff: append a to pending[p] Pre: queue(next[g]) = (a, p) 


Eff: neat[q] := next[g] +1 
internal TO-ORDER(a, p) 
Pre: a is head of pending|p] 
Eff: remove head of pending[p] 
append (a,p) to queue 


Figure 5-5: The To service. 


5.3.2 The implementation of TO 


We provide an implementation of TO using DvS as a building block. The implementation, which we 
call TO-IMPL, consists of an automaton Dvs-TO-TO, for each p € P, and the Dvs specification. 

The implementation is similar to the TO implementation provided in [41]. Both algorithms rely on 
primary views to establish a total order of client messages. The main difference is that the algorithm 
in [41] uses a static notion of primary and the new one uses a dynamic notion. The algorithm of 
[41] is built upon a VS service that reports non-primary as well as primary views, and uses a simple 
local test to determine if the view is primary. That algorithm does some non-critical background 
work (gossiping information) in non-primary views. In contrast, the algorithm proposed here is built 
upon the DVS service, which only reports primary views. Thus the new algorithm is simpler in that 
it does not perform the local tests and does not carry out any processing in non-primary views. On 
the other hand, in the new algorithm, the application programs must perform pvs-REcIsTER actions to 


tell the Dvs service when they have “established” new views. Although the new algorithm appears 
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very similar to the one of [41], the fact that the Dvs service provides weaker and more complicated 


guarantees than the vs service makes the new algorithm harder to prove correct. 


The TO-IMPL algorithm involves normal and recovery activity. Normal activity occurs while a 
group view is not changing. Recovery activity begins when a new primary view is presented by 
Dvs, and continues while the members combine information from their previous history, to provide 
a consistent basis for ongoing normal activity. 

During normal activity, each client message received by TO-IMPL is given a system-wide unique 
label, which consists of a view identifier (the one of the view in which the message is received), a 
sequence number and the identifier of the process receiving the message. The association between 
client messages and their unique labels is recorded in a relation content and communicated to other 
processes in the same view using Dvs. When a message is received, the label is given an order, a 
tentative position in the system-wide total order the service is to provide. When client messages 
have been reported as delivered to all the members of the view, by the “safe” notification of Dvs, 
the label and its order may become confirmed. The messages associated with confirmed labels may 
be released to the clients in the given order. 

The consistent sequence of message delivery within each view keeps this tentative order consistent 
at members of a given view, but it may be not consistent between nodes in different views. To avoid 
inconsistencies processes need state exchange at the beginning of a new view. 

When a new primary view is reported by DVS, recovery activity occurs to integrate the knowledge 
of different members. First, each member of a new view sends a message, using DVS, that contains 
a summary of that node’s state. The summary of a node’s state contains the following information: 
the association of labels with client messages, stored in content, the order of client messages to be 
reported to the clients, stored in order, a pointer to the next client message to be confirmed, stored 
in neztconfirm and the view identifier of the primary view with the highest view identifier in which 
the order sequence has been modified (stored in highprimary). 

Once a node has received all members’ state summaries, it processes the information in one 
atomic step, ie., it establishes the new view. Once a node establishes a view, it informs Dvs of 
that fact with a pvs-recister action. The node processes state information as follows: it defines 
its confirmed labels to be the longest prefix of confirmed labels known in any of the summaries; 
it determines the representatives as the members whose summary include the greatest highprimary 
value; adopts as its new order the order of a “chosen” representative (the chosen representative is 
arbitrary but must be the same for all processes) extended with all other labels appearing in any of 
the received summaries, arranged in label order. 

Then recovery continues by collecting the Dvs safe indications. Once the state exchange is safe, 
all labels used in the exchange are marked as safe and all associated messages are confirmed just as 


in normal processing. 
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DVS-TO-TO 

Signature: 

Input: BCAST(a)p,aEA 
DVS-GPROV(m)g,p, gE P,; MECUS 


DVS-SAFE(™)g,p, € P; MECUS 
DVS-NEWVIEW(v)p, v € V 


Output: | DVS-REGISTERp 
DVS-GPSND(m)p, ME CUS 
BRCV(@)g,p, AC A,GEP 

Internal: CONFIRM, 


State: 
nesxtreport € N>®, init 1 
highprimary € G, init go if p € Po, 1 else 
gotstate, a partial function from P to S, init 0 
safe-exzch C P, init 0 
registered C G, init {go} if p € Po, @ else 
delay € segof(A), init » 
for each g € G, 
established[g], a bool, init true if g = go,p € Po, 
false else 


current € V_, init vo if p € Po, 1 else 
status € {normal, send, collect}, init normal 
content € 2°, init 0 

nextseqno € N>9, init 1 

buffer € segof(L), init » 

safe-labels € 2“, init @ 

order € segof(L), init A 

nextconfirm € N>®, init 1 


Figure 5-6: The Dvs-T0-TO, code. 


For the code, shown in Figures 5-6 and 5-7, we need the following definitions. £ = G x N>° x P 
is the set of labels, with selectors l.id, l.seqgno and l.origin. A is the set of messages that can be 
sent by the clients of the TO service. C = £ x A is the set of possible associations between labels 
and client messages. S = 2° x segof(L) x N>° x G is the set of summaries, with selectors x.con, 
x.ord, x.next and x.high. Given x € S, x.confirm is the prefix of xz.ord such that |x.confirm| = 
min(x.nezt —1,|2.o0rd|). If Y is a partial function from processor ids to summaries, then we define: 
knowncontent(Y ) = Ugedom(v)Y (q).con, 
maaxprimary(Y) = maxgedom(y){Y (q)-high}, 
maanextconfirm(Y ) = maxgedom(y) Y (q). neat, 
reps(Y) = {q € dom(Y) : Y(q).high = mazprimary}, 
chosenrep(Y ) = some element in reps(Y), 
shortorder(Y) = Y(chosenrep(Y)).ord, and 
fullorder(Y) = shortorder(Y) followed by the remaining elements of dom(knowncontent(Y)), in 
label order. 

The system TO-IMPL is the composition of all the DVS-TO-TO, automata and Dvs with all the 
external actions of Dvs hidden. 

The allstate, allcontent and allconfirm derived variables are defined for TO-IMPL as follows (this 
is as in [41]). 

We write allstate[p, g] to denote a set of summaries, defined so that x € allstate|[p, g] if and only 


if at least one of the following hold: 
1. current.id, = g and x = (contenty, order p, nextconfirm, , highprimary,). 
2. x € pending|p, gl. 


3. (x, p) € queue[g]. 
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DVS-TO-TO (cont’d) 
Transitions: 


input BCAST(a)p 
Eff: append a to delay 


internal LABEL(a)p 

Pre: a is head of delay 
current # L 

Eff: let 1 be (current.id, nextseqno, p) 
content := content U {(l, a) } 
append | to buffer 
neztseqno := nextsegno + 1 
delete head of delay 


output DVS-GPSND((I, a))p 
Pre: status = normal 
lis head of buffer 
(l,a) € content 
Eff: delete head of buffer 


input DVS-GPRCV((l,@))q,p 
Eff: content := content U {(l, a) } 
order := orderol 


input DVS-SAFE((l, @))q,p 
Eff: safe-labels := safe-labels U {I} 


internal CONFIRMp 
Pre: order(nextconfirm) € safe-labels 
Eff: neaxtconfirm := nestconfirm + 1 


output BRCV(a)q,p 
Pre: neztreport < nextconfirm 
(order (nextreport),a) € content 
q = order(neztreport).origin 
Eff: neatreport := nextreport + 1 


input DVS-NEWVIEW(v)p 

Eff: current := uv 
neztseqno := 1 
buffer := 
gotstate := 0 
safe-exch := 0 
safe-labels := 0 
status := send 


output DVS-GPSND(z)p 
Pre: status = send 
x = (content, order, nextconfirm, highprimary) 
Eff: status := collect 


input DVS-GPRCV(2)q,p 
Eff: content := content U x.con 

gotstate := gotstate ® (q, x) 

if (dom(gotstate) = current.set) A(status = collect) then 
nextconfirm := maxnextconfirm(gotstate) 
order := fullorder(gotstate) 
highprimary := current.id 
status := normal 
established[current.id] := true 


output DVS-REGISTERp 
Pre: current # L 
established|current.id] 
current.id ¢ registered 
Eff: registered := registered U {current.id} 


input DVS-SAFE(2Z)q,p 
Eff: safe-erch := safe-erch U {q} 
if safe-exch = current.set then 
safe-labels := safe-labels U range (fullorder(gotstate)) 


Figure 5-7: The Dvs-TO-TO, code (cont’d). 


4. For some q, current.idg = g and x = gotstate(p)4q. 


Thus, allstate[p, g] consists of all the summary information that is in the state of p if p’s current view is 
g, plus all the summary information that has been sent out by pin state exchange messages in view g 
and is now remembered elsewhere among the state components of TO-IMPL. Notice that allstate[p, g] 
consists only of summaries: an ordinary message (I, a) is never an element of allstate[p, g]. We write 
allstate[g] to denote U, <p allstate|p, g], and allstate to denote Ujeq allstate[g]. 

We write allcontent for U,canstate t-com U {(l, a) : dg,p: ({l,a),p) € range(queuelg]) V (I,a) € 
range(pending|p, g])}. This represents all the information available anywhere that links a label with 
a corresponding data value. 

We write allconfirm for lubse allstate (x. confirm). 

For every p € P, g € G, buildorder [p, g| is defined to be a sequence of labels, initially empty; this 
variable is maintained by following every statement of processor p that assigns to order with another 


statement buildorder[p, current.id,| := order. It follows that if p establishes a view with id g, and 
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later leaves view g for a view with a higher view identifier, then forever afterwards, buildorder[p, g] 


remembers the value of order, at the point where p left view g. 


5.3.3 Proof that TO-IMPL implements TO 


The correctness proof for TO-IMPL follows the outline of the one in [41]. The main difference is that 
the main invariant, which corresponds to Lemma 6.18 of [41], requires a different, more subtle proof. 


We first provide some auxiliary invariants. 


Invariant 5.3.1 (TO-IMPL) 


In any reachable state, if p € DvS.registered|g] then established|g],. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, DvS.registered[g] is empty for all g except 
for g = go for which DVS. registered|go] = Po. So assume that g = go and p € P. In the initial state 
established|g], = true if g = go and p € po. Hence the invariant is true. 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s’ for any possible step (s,7,s'). Fix g and p. The hypothesis changes from false 
to true only in m=pvs-REGISTER, and s.current, = g, and the action’s precondition (in DVS-TO-TO) 


shows the conclusion. The conclusion never changes from true to false. iW 


Invariant 5.3.2 says that any view that is known (anywhere in the system state) to be an estab- 


lished primary was in fact attempted by all its members. 


Invariant 5.3.2 (TO-IMPL) 
In any reachable state, if x € allstate then there exists w € created such that x.high = w.id, and for 


all p € w.set, p € attempted|w.id]. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, the invariant follows from the definition of 
allstate (set w = current.id,). 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s’ for any possible step (s,7, 8’). The only step that we have to worry about is 
when a new summary is created. When a new summary z is created, x.high is set to the identifier 


of the current view, and a message has been received from everyone in the membership. 0 


Invariant 5.3.3 says that if a view w is established, then no earlier view v can still be active. 


Invariant 5.3.3 (TO-IMPL) 
In any reachable state, if v € created, x € allstate and x.high > v.id then there exists p € v.set with 


current.id, > v.id. 
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Proof: Fix v, x as given. Lemma 5.3.2 shows the existence of w € created such that x.high = w.id, 
and for all p € w.set, p € attempted|w.id]. Then Invariant 5.1.4 implies that there exists p € v.set 


with current-viewid|p| > v.id. But current-viewid[p] = current.id,, which yields the result. O 


Finally we provide the proof for the invariant corresponding to the invariant stated in Lemma 
6.18 of [41]. This invariant has a more subtle proof than the one given in Lemma 6.18 of [41]. That 
proof uses the strong intersection property among primary view membership (in the implementation 
of [41] each primary view intersects each other primary view). The proof in [41] does not work in the 
setting of DVS because Dvs guarantees a weaker intersection property (each primary view intersects 
only the primary views in between the preceding and the following totally registered primary views). 
The new proof also uses the fact about Dvs that once a view is attempted at all processes in its 


membership set, no views with lower identifiers can become established. 


Invariant 5.3.4 (TO-IMPL) 
In any reachable state, suppose that v € created, o € segof(L), and for every p € v.set, the following 
is true: If current.id, > v.id then established|v.id], and o < buildorder[p, v.id]. 


Then for every x € allstate with x.high > v.id, we have that o < x.ord. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, the only created view is vg, and there is no 
x € allstate with x.high > go. So the statement is vacuously true. 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s' for any possible step (s, 7, s’). So fix uv € s’.created and o, and assume that for 
every p € v.set, if s'.current.id, > v.id then s'.established[v.id], and o < s'.buildorder[p, v.id]. 

If v ¢ s.created, then 7 must be creaTeview(v). Then s!.established|v.id], = false for all p. Fix 
x € s'.allstate and suppose that x.high > v.id. Then Invariant 5.3.3 applied to s' implies that there 
exists p € v.set with s’.current.id, > v.id; fix such a p. Then the hypothesis part of the invariant 
for s' implies that s’.established|v.id], = true, a contradiction. It follows that v € s.created. 

As usual, the interesting steps are those that convert the hypothesis from false to true, and those 
that keep the hypothesis true while converting the conclusion from true to false. 

In this case, there are no steps that convert the hypothesis from false to true: If there is some 
p € v.set for which s.current.id, > v.id and either s.established|v.id], = false or o is not a 
prefix of s.buildorder|p,v.id], then also s'.current.id, > v.id (the id never decreases) and either 
s' .established|v.id], = false or a is not a prefix of s'.buildorder[p, v.id]. (These two cases carry over, 
since s.current.id, > v.id implies that established[v.id], and buildorder|p, v.id] cannot change.) 

So it remains to consider any steps that keep the hypothesis true while converting the conclusion 
from true to false. Thus, we assume that if s.current.id, > v.id then s.established[v.id], and 


o < s.buildorder |p, v.id]. Suppose that x € s'.allstate and x.high > v.id. If also x € s.allstate then 
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we can apply the inductive hypothesis, which implies that o < x.ord, as needed. So the only concern 
is with steps that produce a new summary. 

Any step that produces the new summary x by modifying an old summary x’ € s.allstate, in such 
a way that x'.ord < x.ord and z'.high = x.high, is easy to handle: For such a step, x’. high > v.id 
and so the inductive hypothesis implies that o < x’.ord < x.ord, as needed. So the only concern is 
with cprcv, steps that produce a new summary z by delivering the last state-exchange message of 
a view w to some processor p. Thus z.high = w.id. Let x’ be the summary of qg' = chosenrep in 
s'.gotstate. We claim x' high > v.id. 

To prove the claim, we let v' denote the unique element with highest viewid among the elements 
of s'.created such that v’.id < w.id and s'.registered|v'.id| = v'.set. Let v" denote either v' or v, 
whichever has the higher viewid. Invariant 5.1.3 shows that w.set Nv".set # 0, no matter whether 
v" is v or uv’. Fix any element g” in w.set NM v"'.set. 

Recall that the condition for establishing a view shows that domain(s! .gotstate,) = w.set, so by 
the code, either gq” € domain(s.gotstate,) or else gq” is the sender of the message whose receipt is 
the step we are examining. In the former case, let x" be the summary s.gotstate(q’'),; in the latter 
let x’ be the summary whose receipt is the event. In either case we have x" € s.allstate[q", w.id]. 


We now show that s.established|v" .id]q. We consider two cases: 


Then q” € v'.set so by definition of v', we obtain q” € s.registered[v'.id]; therefore, we have 


that s.established|v'.id]q:. 


2. v'' =v. Because s.allstate[q", w.id] is non-empty, the analogue of Part 4 of Lemma 6.7 from 
[41] implies that s.current.idy, > w.id. We have that x.high > v.id by assumption, and 
x.high = w.id by the code; therefore, w.id > v.id. So also s.current.idg > v.id. Recall that 
we are in the case where the hypothesis of this lemma is true. Therefore, by this hypothesis 


(uses g" € v.set), we obtain that s.established[v.id]qv 


By the analogue of Lemma 6.14 from [41], (applied with g” replacing p) we obtain x” high > v" id. 
By the definition of gq’ as a member that maximizes the high component in the summary recorded 
in s'.gotstate, we have x’ .high > x" high. Therefore x' high > v"'.id > v.id, completing our proof of 
the claim. 

If x! high > v.id then we can apply the induction hypothesis to x’ and we are done, since x’.ord < 
z.ord. So suppose x’ .high = v.id. Note that 2' € s.allstate[q',w.id]. By an analogue of Lemma 
6.16 from [41] there must exist’ g € v.set so that s.established[v.id]q, z'.ord = s.buildorder|q, v.id], 
and (either x’ high = w.id or s.current.idg > v.id). Since x’ high = v.id < w.high = w.id, the 


1pirect application of the lemma actually shows the existence of some 6 and q € é.set, but since z’ .high = 6.id 
and also 2’ .high = v.id, uniqueness of viewids shows we may take 6 to be v itself. 
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last property can be simplified to s.current.id, > v.id. By monotonicity of current, we have 
s'.currenty > v.id. The hypothesis of this lemma says that this forces o < s'.buildorder|gq, v.id]. 
Since x'.ord < x.ord by the code for this event, and 2'.ord = s.buildorder|q, v.id] as shown above, 
and s.buildorder[q, v.id] = s'.buildorder|[q, v.id] since q is not currently in view v, we get o < 2.ord, 


which is what we need. | 


Let s be a state of TO-IMPL. The state t = F,,(s) of TO is the following. 


1. t.queue = applyall((s.allcontent, origin), s.allconfirm), 
where the selector origin is regarded as a function from labels to processors. 


2. t.nezt|p] = s.next-report,,. 


3. t.pending[p] = applyall(s.allcontent, b) - s.delay, where b is the sequence of labels such that 


(a) range(b) is the set of labels | such that l.origin = p, (l,a) € s.allcontent for some a, 
and 
l ¢ range(s.allconfirm). 


(b) 6 is ordered according to the label order. 


Figure 5-8: The abstraction function F;,. 


To complete the implementation proof, we define a function from the reachable states of TO-IMPL 
to the states of TO and prove that it is an abstraction function. This function is defined exactly as 
in [41]. Figure 5-8 shows the abstraction function F,,. The proof that F;, is an abstraction function 


is as in [41]. Since F,, is an abstraction function we have the following theorem. 


Theorem 5.3.5 Every trace of TO-IMPL ts a trace of TO. 


5.4 Remarks 


The safe indications provided by the DVS service are crucial to the application: commitments about 
the total order are made only when receiving safe notifications for particular messages. This is a 
common point for any application that needs to preserve strong data coherence. For such appli- 
cations, no commitments about the shared data can be made before safe indications are delivered. 
However the application can still perform some useful work while waiting for a safe indication. For 
example, it can pre-compute the value of the shared data so that when the safe indications arrive 
little processing will be needed (of course such computation is wasted if the safe indication never 
arrives); it can do optimistic updates to the shared data assuming that the safe indications will 
arrive (in this case roll back is required if the safe indications do not arrive). 

The total order service that we have developed in this chapter can be built also using a sequence 


of executions of a consensus algorithms (e.g., the MULTIPAXOS algorithm of [61]?). The advantage 


?The name MULTIPAXOs is actually used in [29]. The original paper by Lamport [61] uses a different name (multi- 
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of the approach taken here is that we use a building block, the Dvs group communication service, 
which may also be used for other applications. For example, in the next chapter, we will use a 
variant of Dvs to build two applications on top of it, an atomic multi-reader multi-writer shared 
register and a dynamic version of the PAXOS algorithm. 

This work deals entirely with safety properties; it remains to consider performance and fault- 
tolerance properties as well. Future work also include investigation of other applications of our Dvs 
specification, such as replicated data applications and load-balancing applications. 

Another interesting exploration direction considers variations on the Dvs specification, for ex- 
ample, one in which the state exchange at the beginning of a new view is supported by the dynamic 
view service. Also, one could provide variations on our specifications that are more specifically 
tuned to systems like Isis and Ensemble. For example, a variation could require that processes that 
move together from one view to the next receive exactly the same messages in the first view. This 
guarantee is offered by Isis and Ensemble. 

In the next chapter we provide a generalization of the DVS service to configurations. This 


generalization will also include support for state transfer at the beginning of a new configuration. 


decree parliament protocol). 
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Chapter 6 


The DC service 


In this chapter we generalize the notion of dynamic primary view to that of dynamic primary config- 
uration. We present DC, a specification for a dynamic primary configuration group communication 
service. Section 6.1 provides the DC specification, Section 6.4 provides an implementation of DC, 


and finally Section 6.5 describes an application that uses DC as a building block. 


6.1 Overview 


The DC specification is similar to the Dvs specification; the difference is that it provides the clients 
with configurations instead of views. Like Dvs, the DC specification is dynamic and provides primary 
configurations. The main difficulty here is that a notion of dynamic primary configuration needs to 
be developed (the notion of dynamic primary view has been studied in several papers, e.g. [55, 89]). 
In this chapter we develop such a notion and we define the DC service, which provides dynamic 
primary configurations to its clients. 

Primary configurations must satisfy certain intersection properties with previous primary con- 
figurations. The type of configurations that we consider in this chapter is the read-write-quorum 
configurations (see Section 3.1.2). The intersection property that we require is that that the mem- 
bership set of a new primary configuration must include the members of at least one read quorum 
and one write quorum of the previous primary configuration. The DC specification provides to the 
client only configurations satisfying this property. 

Change of configurations might be driven either by change in the underlying physical distributed 
system or by the applications running on top of the system (e.g., a new quorum system could be 
installed on the same membership set). 

An important feature of the DC specification is that it incorporates a state-exchange at the 
beginning of a new primary configuration. State-exchange at the beginning of a new configuration is 


required by most applications. When a new configuration is issued each member of the configuration 
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is supposed to submit its current state to the service. Once having obtained the state from all 
the members of the configuration, the DC service computes the most up to date state over all 
the members, called the starting state. The starting state is then delivered to each member of 
the configuration. This way, each member begins regular computation in the new configuration 
knowing the starting state. We remark that this is different from the approach used by the Dvs 
service which lets the members of the configuration compute the starting state. Some existing 
group communication services also integrate state-exchange within the service, e.g., [82, 19, 86], 
some others do not, e.g., [33, 73, 38, 41]. The Transis system [33] can be augmented with a layer 
providing state-exchange [5]. 

The Dc specification offers a broadcast /convergecast communication mechanism. This mecha- 
nism involves all the members of a quorum, and uses a condenser function to process the information 
gathered from the quorum [66]. More specifically, a client that wants to send a message to the mem- 
bers of its current configuration submits the message together with a condenser function to the 
service; then the DC service broadcasts the message to all the members of the configuration and 
waits for a response from a quorum (the type of the quorum, read or write, is also specified by the 
client); once answers are received from a quorum, the DC service applies the condenser function to 
these answers in order to compute a response to give back to the client that sent the message. Such 
a series of actions should be seen as performing an operation requested by the client; executing the 
operation requires the participation of a quorum of the processes. 

We remark that this kind of communication is different from those of the vs service [41] and 
the DvS service. Instead, it is like the one used in [66]. We integrate it into DC because we want 
to develop a particular application that benefits from this particular communication service (a 


read/write register as is done in [66]). 


6.2. The Dc specification 


Prior to providing the code for the DC specification, we need some notation and definitions, which 
we introduce in the following. 

Let OID be a set of operation identifiers, partitioned into sets OID,, p € P. We denote by 
M,C M the set of messages that clients may use for communication. 

Let A be a set of “acknowledgment” values and let R be a set of “response” values. A condenser 
function is a function from (P + A) to R. Let ® be the set of all condenser functions. Let S 
be the set of all possible states of the clients (a state of S does not need to be the entire client’s 
state, but it may contain only the relevant information in order for the application to work). The 
DC specification uses a condenser function also to compute the starting state of a new configuration; 


hence we assume that S C A and also S C R. Given a function f : P — D from the set of processes 
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P to some domain D and given a subset P C P, we write f|P to denote the function f’ : P > D, 
defined as f'(p) = f(p) for p € P. 

The following data type is used to describe operations: D = M x ® x {“read”, “write” } x 2? x 
(P — A) x Bool and we let O = OID - D_. Given an operation descriptor, selectors for the 
components are msg, cnd, sel, dlu, ack, and rsp. 

The code of the Dc specification is given in Figure 6-1. 

Next we provide remarks and an informal description of this code. We start with the derived 
variables. 

A configuration c € Att is said to be attempted. For an attempted configuration c there exists at 
least one process p that has executed action newconrF(c)p and thus we have that p € attempted|[c.id]; 
when this holds we say that c is attempted at p or that p has attempted c. A configuration c € Jot Att 
is said to be totally attempted. A totally attempted configuration is a configuration that is attempted 
at all members of the configuration. 

A configuration c € Est is said to be established. For an established configuration c there exists at 
least one process p that has executed action newstatE(s), and thus we have that p € state-dlu[c.id]; 
when this holds we say that c is established at p or that p has established c. A configuration 
c € JotEst is said to be totally established. A totally established configuration is a configuration that 
is established at all members of the configuration. 

A dead configuration c is a configuration for which a member process p went on to newer config- 
urations, that is, it executed action newconr(c’)p with c'.id > c.id, before receiving the notification, 
that is the newconr(c)p event, for configuration c. 

Now we comment on the transitions. 

Action cREATECONF(c) creates a new configuration c. The first precondition simply requires this 
new configuration to have a brand new identifier. The second precondition of this action is the key 
to our specification. It states that when a configuration c is created it must either be already dead or 
for any other configuration w such that there are no intervening totally established configurations, 
the earlier configuration (i.e., the one with smaller identifier) has at least one read quorum and one 
write quorum that are subsets of the membership set of the later configuration (i.e., the one with 
bigger identifier). 

Action newconr(c), delivers a created configuration c to the client process p. The precondition 
of this action makes sure that configurations are delivered in order of configuration identifiers. We 
notice that because of this precondition, when a configuration c is dead because a process g went on 
to newer configurations, we have that process g can no longer execute action NEWCONF(c)q. 

Once a configuration c has been delivered to a client process p, the client process p is supposed 
to submit its current state s and a condenser function ¢, by means of action supMitT-sTaTE(s,¢)p. Once 


all the processes have submitted their current states, the condenser function ¢ is used to compute 
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DC 
Signature: 


Internal: CREATECONF(c), c € C 

Output: NEWCONF(c)p, c € C, p € c.set 
NEWSTATE(S)p, SES 
RESPOND(a,i)p, a € A,i € OIDp, DEP 
DELIVER(m,1)p, m € Mc, i € OID, pE P 


Input: SUBMIT(m, ¢,b,i)p,m € Mc, dE ®, 
b € {“read” , “write”}, p € P,i € OID, 
ACKDLVR(a,i)p, a€ A,i€ OID, pEP 
SUBMIT-STATE(S,¢)p, s€ S, PE @ 


State: 
created € 2°, init {co} 
for each p € P: 
cur-cid|p] € G1, init go if p € Po, 1 else 
for each g € G: 
attempted[g] € 2”, init Po if g = go, {} else 


for each g € G: 
got-state|g] € P — S_, init everywhere L 
condenser|g] € 21: init everywhere L 
state-dlu[g] € 2”, init Po if g = go, {} else 
pending[g] € O, init everywhere 1 


Derived variables: 
Att € 2°, defined as {c € created|attempted[c.id] 4 0} 


Tot Att € 2°, defined as {c € created|c.set C attempted|[c.id]} 
Est € 2°, defined as {c € created|state-dlu[c.id] 4 0} 


TotEst € 2°, defined as {c € created|c.set C state-dlv[c.id]} 


dead € 2© defined as dead = {c € C|Ap € c.set : cur-cidp > c.id and p ¢ attempted|c.id]}. 


Actions: 
internal CREATECONF(c) 
Pre: Vw € created : c.id # w.id 
if c g dead then 
Vw € created, w.id < c.id: 
w € dead or 
(Aa € TotEst: w.id < aid < c.id) or 
(AR € w.rgqrms, IW € w.wgrms: 
RUW Ce.set) 
Vw € created, w.id > c.id 
w € dead or 
(da € TotEst: cid < a.id < w.id) or 
(AR € c.rgrms, IW € c.wgrms: 
RUW C w.set) 
Eff: created := created U {c} 


output NEWCONF(c)p, p € c.set 
Pre: c € created 
c.id > cur-cid|p] 
Eff: cur-cid[p] := c.id 
attempted|c.id] := attempted[c.id] U {p} 


input SUBMIT-STATE(s, ¢)p 
Eff: if cur-cid[p] # L and 
got-state|cur-cid[p]|(p) = L then 
got-state|cur-cid[p]|(p) := s 
condenser |cur-cid[p]|(p) := ¢ 


output NEWSTATE(s)p choose c 

Pre: c.id = cur-cid[p] 
c € created 
Vq € c.set, got-state[c.id](q) A L 
let f = condenser|c.id](p)|c.set 
s = f(got-state[c.id]) 
p ¢ state-dlu|c.id] 

Eff: state-dlu[c.id] := state-dlvu[c.id] U {p} 


input SUBMIT(m, ¢, b,7)p 
Eff: if cur-cid[p] # L then 
pending|cur-cid|[p]](7) 
:= (m, ¢, 6,0, A(z) : a + L, false) 


output DELIVER(m,i)p choose g 
Pre: g = cur-cid[p] 
p € pending|g](t).dlv 
pending|g](i).msg = 
Eff: pending[g](i).dlu := pending[g](i).dlv U {p} 


input ACKDLVR(a,1)p 
Eff: if cur-cid[p] A L and 
pending|cur-cid[p]](i).ack(p) # L then 
pending|cur-cid[p]](t).ack(p) :=a 


output RESPOND(r,1)p choose c,Q 
Pre: c.id = cur-cid[p] 
c € created 
i € OID, 
pending|c.id](i).rsp = false 
if pending|[c.id].sel = “read” 
then Q € c.rqrms 
else Q € c.wqrms 
let f = pending|c.id](i).ack 
Wg Q: f(q) #1 
r = pending|c.id](1).cnd(f|Q) 
Eff: pending[c.id](i).rsp := true 


Figure 6-1: The Dc specification 


the starting state of configuration c for process p. The code of this action just memorizes the state 
s and the condenser function ¢ for the current configuration of process p. 

Action newstatx(s), Computes the starting state for a configuration c. The precondition of this 
action requires that all processes g in the membership of configuration c have submitted their state 
for configuration c. The starting state s of configuration c for process p is then computed by applying 
the condenser function that process p has submitted to the service with action susmit-sTaTE(s, ¢)p. 
Variable state-dlv|[c.d] records the fact that p has received the starting state for configuration 

We remark that for a dead configuration c there is at least one process that does not execute 
action NEWconF(c), and thus does not submit its state for c with action suBMItT-sTATE(s, ¢),. This implies 
that action NEwsTaTE(s), cannot be executed for any process g. This is why such configurations are 
called “dead”. 

The remaining actions are used to handle the requests of clients. We refer to the process of han- 
dling such a request, which involves the participation of a quorum of processes, as an “operation”. 
To request the execution of an operation a client process p uses action suBMIT(m, ¢,6,i)». The param- 
eters of this actions are as follows: m is a message describing the operation that p needs to perform 
(e.g., read a register, write a register); @ is a condenser function to be used to compute a response 
value for p when a quorum of processes have provided acknowledgment values to p’s message m; b is 
just a selector indicating whether to wait for acknowledgment values from a write quorum or from a 
read quorum; i is an operation identifier needed to distinguish operations (every requested operation 
has a unique operation identifier). We say “operation «” to indicate the operation requested with 
action susmit(m, ¢,6,i)p. For configuration c and operation i, the variable pending|[c.id](i) contains 
an operation descriptor; The code of action susmit(m, ¢,b,i), sets to a default value this operation 
descriptor. 

We now provide an explanation for each component of an operation descriptor. Let d be an 
operation descriptor for operation i requested by p in configuration c. d.msg is the message that 
describes the request of p; such a message will be delivered to all members of the configuration 
c. d.cnd is the condenser function that will be used to compute the response for the operation 
once a quorum of processes has provided acknowledgment values. d.sel is the selector that specifies 
whether to use a read or a write quorum. d.dlv is the set of processes to which the message d.msg 
has been delivered; initially this is set to an empty set by action susmit(m,¢,b,i)p. d.ack contains 
the acknowledgment values received; initially this is a vector of L values. Finally, d.rsp is a flag 
indicating whether or not the client p, which requested the operation, has received a response for 
the operation. 

Action DELIVER(m,i), delivers the message m of operation i to process p. The code of this action 
updates the operation descriptor d for operation i by adding process p to the set d.dlv. 


Processes that receive the message m for an operation i are supposed to provide an acknowledg- 
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ment value a with action ackpLvr(a,i),. The code of this action records the acknowledgment value a 
of process p into the vector d.ack, where d is the operation descriptor for operation i. 

Finally, action responp(r, i) provides a response r to process p for the operation 7 previously sub- 
mitted by p. The precondition of this action requires that a quorum Q has provided acknowledgment 
values (the type of the quorum depends on the selector provided at the time of the operation sub- 
mission). Then the value r is computed by applying the condenser function provided by p at the 
time of the submission, to the acknowledgment values of processes in Q. At this point the operation 


has been serviced and the rsp component is set to true. 


6.3 Invariants of DC 


In this section we provide invariants of Dc. These invariants are used to prove the correctness of the 


application that we build on top of DC. 


Invariant 6.3.1 In any reachable state of DC, the following is true. Let c,, co € created \ dead, with 
cy.1d < c2.id. Then either exists w € JotEst, c,.id < w.id < co.id, or else there exist R, W, quorums 
of c, such that RUW C cg. set. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
are no two configurations c,,c2 € created such that c,.id < co.id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s’ for any step (s,7,s’). The only action that we need to worry about is 
CREATECONF(c), where c = c; or c = Cg, because it creates a new configuration (otherwise the invariant 
is true in s’ by the inductive hypothesis). So assume that 7 =creatEeconr(c). The invariant follows 


easily from the precondition of z. 0 


We remark that the intersection property stated in the above invariant may not hold for dead 
configurations. However, in a dead configuration is not possible to make progress because for such 
a configuration there is at least one process that will not participate and thus the configuration will 
never become established. 

The need for considering dead configurations comes from the implementation of the specification 
that we provide. It is possible to give a stronger version of DC by requiring that the intersection 
property in the precondition of action createconr holds also for dead configurations. We do not 


know if this stronger version is implementable. 


Invariant 6.3.2 In any reachable state of DC, the following is true. If p € attempted|g] then 
cur-cid[p] > g. 
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Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state we have that attempted|g] is Po if g = go and 
@ otherwise. Moreover for each p € Po we have that cur-cid[p] = go. Hence the invariant is true. 

For the inductive step fix g and p and assume that the invariant is true in a state s. We need to 
prove that the invariant is true in s’ for any step (s,7,s'). The actions that can make the invariant 
false are those that either put p into attempted[g] or modify cur-cid[p]. Hence we need only to 
worry about action Newconr(c). So assume that 7=NnEwconr(c). The invariant follows easily from the 


precondition of z. 0 


Invariant 6.3.3 In any reachable state of DC, the following is true. Ifc € created\ dead, w € Tot Att, 
and w.id > c.id, then there exists R € c.rgrms such that for all p € R, cur-cid|p] > c.id. 


Proof: Consider any particular reachable state. Assume that c € created \ dead, w € TJotAtt, and 
w.id > c.id. Then let y be the configuration in Jot.Att having the smallest identifier strictly greater 
than c.id. Note that y ¢ dead, since a dead configuration cannot be totally attempted. Then there 
is no x € JotEst with cid < x.id < y.id. Then Invariant 6.3.1 implies that for some R € c.rgrms, 
RC y.set. Let p be any element of R. Since y € TJotAtt we have that p € attempted[y.id]. By 


Invariant 6.3.2 we have cur-cid[p] > y.id . Since y.id > c.id we have cur-cid[p] > c.id. O 


6.4 An implementation of DC 


In this section we provide an algorithm that implements, in the sense of trace inclusion, the DC 


specification. The algorithm is built on top of the CS service and uses ideas from [88]. 


6.4.1 Overview 


The implementation of DC that we provide in this section is similar to the implementation of Dvs 
provided in Chapter 5. However, there is a key difference in the implementation of DC compared to 
that of Dvs. This difference provides new insights for the Dvs specification and implementation, as 
we explain below. 

The DVS specification requires a global intersection property which is the following: given two 
primary views w and v with no intervening totally established view!, we must have that w.set U 
v.set # @. The Dvs implementation, when delivering a new view v, checks a stronger property locally 

1“Rstablished” views are called “registered” views in Chapter 5. This is due to the fact that the Dvs specification 
requires client processes to “register” a new view when they have obtained enough information to begin regular 
computation; in the Dc service this is handled by the service itself. However the meaning of “established” is the same 
as that of “registered”, that is, a client process has got the information needed to proceed with regular computation. 


We use a different name just to emphasize the fact that in pvs clients need to “register” views while in Dc configurations 
become “established” under the control of the service. 
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to the processes, which requires that |v.set U w.set| > |w.set|/2 for all the views w, w.id < v.id, 
known by the process performing the check. 

The DC specification requires a global intersection property which is the following: given two 
primary configurations, both of which are not dead, with no intervening totally established configu- 
rations, then there must exist a read and a write quorum of the configuration with a smaller identifier 
which are included in the membership set of the configuration with bigger identifier. The Dc imple- 
mentation checks the same property locally to each process. The intuitive reason why by checking 
locally the same property we can prove it also globally is that we exclude dead configurations. This 
suggests that also for DvS we can prove the stronger intersection property (the one checked locally) 


or we can use a weaker local check (the intersection required globally) if we do exclude dead views. 


The DC specification is built upon a static configuration service, called cs. This service is basically 
the vs service adapted to handle configurations (see Section 4.2). 

The automaton CS-TO-DC, is given in Figures 6-2 and 6-3. The overall system DC-IMPL is defined 
as the composition of CS and CS-TO-DC,, for p € P. 

Automaton CS-TO-DC, uses special messages, tagged either with “info”, used to send informa- 
tion about the active and ambiguous configurations, or with “got-state”, used to send the state 
submitted by a process to all the members of the configuration. The former information is needed 
to check the intersection property that new primary configurations have to satisfy according to the 
DC specification. The latter information is needed in order to compute the starting state for a new 
configuration. Thus, we use M = M.U {(“info” x V x 2”)}U {“got-state” }, where M, is the set of 
all client messages and M is the universe of all messages. 

The major problem is that the Dc specification requires a global intersection property (i.e., a 
property that can be checked only by someone that knows the entire system state), while each single 
process has a local knowledge of the system. So, in order to guarantee that a new configuration 
satisfies the requirement of DC, each single process needs information from other processes members 
of the configuration. 

Informally, the filtering of configurations works as follows. Each process keeps track of the latest 
totally established configuration, called the “active” configuration, recorded into variable act, and 
a set of “ambiguous” configurations, recorded into variable amb, which are those configurations 
that were notified after the active configuration but did not become established yet. We define 
use = act U amb. When CS provides a new configuration to process p by means of action cs- 
NEWCONE(c)p, process p sends out an “info” message containing its current act, and amb, values to all 
other processes in the new configuration, using the Cs service, and waits to receive the corresponding 
“info” messages for configuration c from all the other members of c. After receiving this information 
(and updating its own act, and amb, accordingly), process p checks whether c has the required 


intersection property with each view in the use, set. If so, configuration c is given in output to the 
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client at p by means of action NEwconF(c)p. 

When a new primary configuration c has been given in output to process p by means of action 
NEWCONF(c)p, the client at p submits its current state together with a condenser function to be used 
to compute the starting state when all other members have submitted their state (such a condenser 
function depends on the application). Clearly the state of p is needed by other processes in the 
configuration while p needs the state of the other processes. Hence when a supmrt-sTATE(s,¢)p is 
executed at p, the state s submitted by process p is sent out with a got-state message to all other 
members of the configuration, using the Cs service. Upon receiving the state of all other processes, 
CS-TO-DC, uses the condenser function ¢ provided by the client at p in order to compute the starting 
state to be output, by means of action newsTATE(s)p, to the client at p. 

The communication mechanism of DC is quite different from that offered by Cs: Dc offers a 
broadcast /convergecast communication mechanism, while Cs offers point-to-point communication 
channels. However it is not difficult to implement the former by using the communication service of 
the latter. The relevant code is in Figure 6-3. When a message m is submitted by means of action 
SUBMIT(m, ¢,b,7)p, together with a condenser function ¢, a quorum-type b € {“read”, “write” }, and 
an operation identifier i, message m, tagged with i, is sent to all the members of the configuration 
using the Cs service, and an operation descriptor for ¢ is initialized. When another process q receives 
the message m of operation i, it delivers it to its client by means of action pEtiveR(m,i)g. At this 
point the client at q is expected to supply an acknowledgment value a for operation i by means of 
action ackptvR(a,i),. This value is sent back to p using the Cs service. When p receives this value it 
updates the descriptor of operation i with the value obtained from q. If there are acknowledgment 
values from a quorum of the type specified by 6, then the condenser function ¢ is applied to the 
acknowledgment values of this quorum in order to compute a response to the message m submitted 
by p. Such a response is given to p by means of action RESPOND(m, i) p. 

There are five derived variables for DVS-IMPL. Four of them are analogous to those of DVS, 
indicating the attempted, totally attempted, established, and totally established views, respectively. 
A fifth one, use,, keeps track of the set of configurations used to check the intersection property 


before attempting a new configuration. 


6.4.2 Invariants of DC-IMPL 


This section contains invariants of DC-IMPL needed for the proof that DC-IMPL implements DC in 
Section 6.4.3. The proofs of these invariants are similar to those of the corresponding invariants of 
the Dvs implementation (see Section 5.2.2). This is because the implementation of DC is similar to 
the implementation of Dvs, thus many basic invariants are the same. For these basic invariant we 
provide an operational proof (i.e., a proof that does not rely exclusively on the state previous to the 


one for which the invariant states a property) for each of the invariants. 
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CS-TO-DC 


Signature: 


Input: CS-NEWCONF(C)p, CE C, p € c.set 
CS-GPRCV(Mm)g,p,m € M,qeEP 
CS8-SAFE(m)g,p, m € M, qe P 
SUBMIT-STATE(S,¢)p, SES, GE® 
SUBMIT(m, @, 6, i)ps mE Me, co) € $, 

b € {“read”, “write”}, p € P, i € OID, 
ACKDLVR(a,i)p,a € A,i € OID, pE P 


State: 


cur €C_, init co if p € Po, 1 else 
client-cur € C_, init co if p € Po, 1 else 
act € C, init co 
amb € 2, init 0 
attempted € 2°, init {co} if p € Po, 0 else 
for each gE G,qEP 
info-rcvd|q, 9] € (C x 2°) 1, init L 
rcevd-estblq,g] € (C x 2°) 1, init L 


Derived variables: 


Internal: GARBAGE-COLLECT(c)p, cE C 

Output: CS-GPSND(m)p, m € M 
DC-NEWCONF(C)p, ¢ € V, p € c.set 
DC-NEWSTATE(S)p, 5 € S 
DELIVER(m,i)p,m € Mc, i € OID, pEP 
RESPOND(a@,1)p, a € A,i € OIDyp, pEP 


for each g € G 

to-cs|g] € seqof(M), init » 

info-sent[g] € (C x 2°) 1, init L 

dlu-queue[g] € seqof(M), init » 

cond|g] € ®,, init L 

pend|g] € O, init L 

msg-dlvd|g| = OID > {true, false} 

state-got[g] = P > S1, init L 

estb[g] a bool, init true if p € Po and g = go, 
false else 


Att € 2°, defined as Att = {c € created | (Ap € c.set)c € attempted, }; 

Est € 2°, defined as Est = {c € created | (Ap € c.set)estb[c.id]p = true}; 
TotAtt € 2°, defined as TotAtt = {c € created | (Vp € c.set)c € attempted, }; 
TJotEst € 2°, defined as JotEst = {c € created | (Vp € c.set)estb[c.id]p = true}. 


use € 2°, defined as use = {act} U amb 
Transitions: 


input CS-NEWCONF(c)p 
Eff: cur :=c 
append (“info”, act, amb) to to-cs[cur.id] 
info-sent[cur.id] := (act, amb) 


input CS-GPRCV((“info”,c,C))q,p 
Eff: info-revd|q, cur.id] := (c,C) 
if cid > act.id then act := c 
amb := {w € ambUC | w.id > act.id} 


input Cs-sAFE((“info”,c,C))q,p 
Eff: none 


output NEWCONF(c)p 

Pre: c= cur 
c.id > client-cur.id 
Vq € c.set,qg# p: info-revdlq, c.id] 4 L 
Vw € use: IR € w.rgqrms, R € c.set 
Vw € use: IW € w.warms, W € c.set 

Eff: amb := ambU {c} 
attempted := attempted U {c} 
client-cur := c 


input SUBMIT-STATE(s, ?)p 
Eff: g = client-cur.id 
if g ~ L then 
state-got|[g|(p) := s 
cond|g] := ¢ 
append (“state-got”,s) to to-cs[g] 


input Cs-GPRCV((“state-got”, 8) )q,p 
Eff: state-got[cur.id](q) := s 


input Cs-SAFE((“state-got”, s))q,p 
Eff: none 


output NEWSTATE(s)p 

Pre: g = cur.id 
g#1L 
Vq € c.set, state-got|g|(q) 4 L 
s = cond[g|( state-got|g]|cur.set) 
estb[g| = false 

Eff: estb[g] := true 
append “established” to to-cs[g] 


input Cs-GPRCV( “established” )gp 
Eff: rcvd-estb[q, cur.id] := true 


input CS-SAFE( “established” )q,p 
Eff: none 


internal GARBAGE-COLLECT(c)p 
Pre: Vq € c.set, rcvd-estb[q, c.id] = true 
c.id > act.id 
Eff: act := cur 
amb := {w € amb | w.id > act.id} 


output CS-GPSND(™m) 
Pre: m is head of to-cs|cur.id] 
Eff: remove head of to-cs[cur.id] 


Figure 6-2: CS-TO-DC, 


CS-TO-DC (transitions cont’d) 
input SUBMIT(m, ¢, b,7)p 
Eff: g = chient-cur.id 
if g # L then 
pend[g](é) == (m, ,6,0, Me) : © + 1, false) 
append (m, 1) to to-cs{g] 


input ACKDLVR(a,%)p 
Eff: append (a,i) to to-cs[client-cur.id] 


input CS-GPRCV((a,1))q,p 
Eff: if i € OID, then 
pend|i].ack(q) := a 
input CS-GPRCV((m, t))q,p 


Eff: append (m,i) to dlu-queue[cur.id] input CS-SAFE((a, 7) )g,p 


Eff: none 
input CS-SAFE((m, 7))q,p 
Eff: none output RESPOND(r,i)p choose Q 
Pre: g = cur.id 
output DELIVER(™m, 1)p g#L 
Pre: (m,i) = head(dlv-queue|[cur.id]) 4 € OIDy 


msg-dlvud|cur.id](i) = false 
Eff: dlv-queue[cur.id] = tail(dlu-queue[cur.id]) 


pend|g|(i).rsp = false 
if pend|g].sel = “read” 


then Q € cur.rgqrms 
else Q@ € cur.wgrms 
let f = pend|g](i).ack 
VaeQ: fgAt 
r = pend[g](i).cnd(f|Q) 
Eff: pend[g](i).rsp := true 


| 
msg-dlud|cur.id](i) := true 


Figure 6-3: CS-TO-DC, (transitions cont’d) 


Invariant 6.4.1 (DC-IMPL) 
In any reachable state if ( “info”, x, X) € to-cs|g|, or ( “info”, x, X) € pending[p, g| or ({ “info”, z,X),p) € 
queue|g] or info-revd[p, gq = (z,X), then info-sent|g], = (x, X) and cur.idy > g. 


Proof Sketch: This invariant is true because whenever process p puts the message (“info”, x, X) 
into to-cs[g], in action cs-Newconr(c), where c.id = g, it sets info-sent|g], := (x,X). Moreover at 
that moment it also sets cur := c. From that moment on, because configuration identifiers provided 
by Cs only increase, we have that cur.id, > g. Clearly this continues to be true when the “info” 


message goes through pending|p, g], queuel|g] and finally gets to q and is recorded into info-rcvd[p, g]q. 
O 


Invariant 6.4.2 (DC-IMPL) 
In any reachable state, if info-sent[g], = (z,X), w € attempted,, and w.id < g, then either w € 
{a}UX or w.id < x.id. 


Proof Sketch: If process p sent an “info” message for configuration g and has also attempted a 
previous configuration w, then either process p has already garbage-collected configuration w (if 
a configuration with identifier bigger than w.id has been totally established) which implies that 
w.id < «id or w is still in the use set of p, which implies that w € {a} UX. O 


Invariant 6.4.3 (DC-IMPL) 


In any reachable state: 
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1. actp € Fotést. 
2. If info-sent|g], = (a, X) then x € TotEst. 


3. usep N TotEst F 0. 


Proof Sketch: Variable act, is initially a totally established configuration and is updated always 
to a totally established configuration (see action carBacs-cottecr). Hence Part 1 follows. Part 2 
follows from the fact that if info-sent|g], = (x,X) then value of x is the value of the variable act, 
at the time when info-sent|g], is written, and thus by Part 1 is totally established. Part 3 follows 


from Part 1 and the definition of use,. O 


Invariant 6.4.4 (DC-IMPL) 


In any reachable state: 
1. If cur, # L and w € usep, then w.id < cur.idy. 
2. If curp # L and client-cur, # cur, and w € useép, then w.id < cur.idp. 
3. If info-sent|g], = (z,X) and w € {a}UX then w.id < g. 


Proof Sketch: The only action that adds a new configuration c to the use, is action NEWconrF(c)p. 
The precondition of this action requires that cur, = c which implies cur.id = c.id. The conclusion 
follows from the property of Cs that configurations identifier are released in increasing order. This 
proves Part 1. 

Part 2 follows by observing that when cur, # 1 and client-cur, # cur, a new configuration has 
been provided by Cs but it has not been attempted yet. This implies that the current configuration 
cur, cannot be in the use, set. Combining this with the conclusion of Part 1, we have that for any 
w € usep, w.id < cur.idp. 

Finally Part 3 follows form the fact that process p sends information for a new configuration which 
has not been attempted yet. Once again, implies that the current configuration cur, cannot be in 


the use, set and we conclude as for Part 2. O 


Invariant 6.4.5 (DC-IMPL) 
In any reachable state, if info-rcvd[g, g]p = (x,X) and w € {a}UX, then either w € usep or 


w.id < act.idy. 


Proof Sketch: Since info-rcevd[q, 9], = (z,X) we have that process p has received (#,X) from 
process g. When receiving this information (action cs-cprov((“info”,x,X))) process p updates its use, 


and act.id, sets. The conclusion follows from the code of cs-cprev((“info” ,x, X)). Oo 
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Invariant 6.4.6 (DC-IMPL) 


In any reachable state, if c € attempted, and q € c.set then cur.idg > c.id. 


Proof Sketch: Since c € attempted, we have that there has been a step Newconr(c)p. By the 
precondition of this action, since g € c.set we have that process p received information from q for 
configuration c, that is, info-sent[c.id], = (x, X). By Invariant 6.4.1 we have that cur.id, > c.id. 


O 


The following invariant states a simple fact about DC-IMPL. This invariant does not have a 
corresponding one in Dvs. However, since the statement of this invariant is simple, also for this 


invariant we provide an operational proof. 


Invariant 6.4.7 (DC-IMPL) 
In any reachable state, if (m,i) € dlv-queuelg], then pend[g](i).msg, = m, where p is such that 
i € OID,. 


Proof Sketch: This is true because if (m,i) € dlu-queue|g], we have that process q received the 
message (m,i) from a process p, with i € OID,. Moreover process p sent the message (m,i) in 


action suBmit(m,¢,6,i)p and this action sets pend[g](i).msg, =m. o 


Next we provide the main invariants that are needed in the proof of the simulation relation. 

Invariant 6.4.8 states that if a configuration c has been attempted by a process p and its mem- 
bership contains a process q which has attempted a configuration w previous to c and there is no 
totally established configuration between w and c then c contains a read and a write quorum of 
w. Intuitively this is true because when p attempts c it must have received information from all 
the members of c, thus also from gq; since g attempted w and w has not been garbage collected, 
because there are no totally established configurations in between w and c, process qg includes w 
in the information it sends to p for configuration c. Invariant 6.4.9 generalizes Invariant 6.4.8 by 
claiming that the intersection property holds for any configurations w and c such that there is no 
totally established configuration x with w.id < x.id < c.id. 

Finally Invariant 6.4.10 provides an intersection property crucial to the simulation relation. This 


last invariant is the one where dead configurations are needed. 


Invariant 6.4.8 (DC-IMPL) 

In any reachable state, suppose that c € attempted,, q € c.set, w € attempted,, w.id < c.id, and 
there is no « € TJotEst such that w.id < x.id < cid. Then there exist R € w.rgrms and W € w.wgrms 
such that RUW C c.set. 


Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only co is attempted, so the hypotheses 


cannot be satisfied. Thus, the statement is vacuously true. 
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For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s’ for any possible step (s,7,s'). Fix c, w, p, and q, and assume that c € s'.attempted,, q € c.set, 
we s'.attempted ,, w.id < c.id, and there is no z € s'.Jot€st such that w.id < x.id < c.id. Then also 


there is no x € s.Jot€st such that w.id < x.id < c.id. We consider four cases: 


1. c€ s.attempted, and w € s.attempted,. 


Then by the inductive hypothesis we have that in state s there exist R € w.rgrms and W € 


w.wgrms such that RUW C c.set. Clearly this remains true in s' too (it remains true forever). 


2. c¢ s.attempted, and w ¢ s.attempted,. 


This cannot happen because we cannot have both c and w becoming attempted in a single 


step. 


3. c ¢ s.attempted, and w € s.attempted,. 


Then 7 must be newconr(c)p. Since g € c.set, by the precondition of 7 we have that s.info-rcvd|q, c.id], = 
(x, X) for some « and X. Then Invariant 6.4.1 implies that s.info-sent[c.id], = (x, X). Then, 
since w.id < c.id, Invariant 6.4.2 implies that either w € {«}UX or w.id < x.id. We claim 
that it must be w € {z}UX. Indeed if w.id < wid, by Invariant 6.4.3 we have that x € s.Jot&st 
and by Invariant 6.4.4, Part 3 (used with w = x) we have w.id < c.id; thus we would have a 
totally established configuration x such that w.id < x.id < c.id. This contradicts the inductive 
hypothesis. So it must be w € {a}UX. 

By Invariant 6.4.5 we have that either w € s.use, or w.id < s.act.id,. In the former case, by 
the precondition of 7, we have the needed conclusion. In the latter case, we obtain a contra- 
diction. Indeed by Invariant 6.4.3 we have s.act, € Jot€st. Moreover by the precondition of 
mT, $.cur, cannot be | and s.cur, > s.client-cur, and, by definition, s.act, € s.use,. Hence 
by Invariant 6.4.4, Part 2, we have s.act.idp < s.cur.id, =c.id. Thus we would have a totally 
established configuration act such that w.id < act.id < cid. This contradicts the inductive 


hypothesis. 


4. c€ s.attempted, and w ¢ s.attempted ,. 


Then 7 must be newconr(w)g- But this cannot happen. Indeed since c € s.attempted, and 
q € c.set, Invariant 6.4.6 implies that s.cur.id, > c.id. Since c.id > w.id, we have s.cur.idg > 


w.id. But the precondition of action 7 requires s.cur.id, = w.id, so x is not enabled in s. 


Invariant 6.4.9 (DC-IMPL) 


In any reachable state, suppose that c € Att, and w € JotEst, w.id < c.id, and there is no x € TJotEst 
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such that w.id < x.id < cid. Then there exist R € w.rgrms and W € w.werms such that RUW C 


c.set. 


Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only co is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s, 7, s’). Fix c and w, and assume that c € s’.Att, w € s'. Jot&st, w.id < c.id, 


and there is no x € s'.Jot€st such that w.id < x.id < c.id. We consider four cases: 


1. c€ s.Att and w € s.7Jotést. 


The invariant follows from the inductive hypothesis. 


2. c¢ s.Att and w ¢ s.Totést. 


This cannot happen because we cannot have both c becoming attempted and w becoming 


totally established in a single step. 


3. c¢ s.Att and w € s.Jotést. 


Then a must be newconr(c), for some p. The precondition of 7 implies that, for any configura- 
tion y € s.use,, there exist R € y.rgrms and W € y.wgrms such that RUW C c.set. Hence to 
prove the claim it is enough to prove that w € s.use,. We proceed by contradiction assuming 


that w ¢ s.usep. 


By Invariant 6.4.3, Part 3, s.useps.Jot€st 4 0. Let m be the configuration in s.use,s.TotEst 
having the biggest identifier. We know that m # w because w ¢ s.use,. Also, m # c, because 


m € s.JotEst and c ¢ s.Jotést. It follows that m.id 4 w.id and m.id F c.id. 


We claim that m.id < w.id. We have already shown that m.id 4 w.id. Suppose for the sake 
of contradiction that m.id > w.id. From the precondition of action 7 we have that s.curp = c 
and hence s.curp # 1. Also from the precondition of 7 we have that s.client-curp < s.curp. 
Since m € s.use,, Invariant 6.4.4, Part 2, implies that m.id < s.cur.id, and since s.cur = c 
we have that m.id < c.id. So w.id < m.id < c.id. Since m € s'.JotEst, this contradicts the 


hypothesis of the inductive step. Therefore, m.id < w.id. 


Let n be the configuration in s.Jot€st that has the smallest identifier strictly greater than that 
of m. Remember that w € s’.Jot€st and since 7 =nEwconF(c)p we have that w € s.Jot€st and 
thus such an n exists and satisfies m.id < n.id < w.id. Since m € s.usep, the precondition of 
m implies that there exist R € m.rgqrms and W € m.wagrms such that RUW C c.set. Since by 
inductive hypothesis the invariant is true in state s, we have that there exist R’ € m.rgrms and 


W!' € m.wqrms such that R' UW’ C n.set. By the properties of quorums we have that there 
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exists one process g € (RU W)1 (R' UW’) and thus we have that g € n.set Mc.set. By the 
precondition of a, s.info-rcvd|q, c.id|, = (x, X) for some z,X. Then Invariant 6.4.1 implies 
that s.info-sent|c.id], = (x,X) and Invariant 6.4.3 says that x € s.Jot€st. Then Invariant 
6.4.4, Part 3 (used with w = x), implies that x.id < c.id. Since n € s.JotEst, we have that 
n € s.attempted,. Then Invariant 6.4.2 (used with w =) implies that either n € {z}U X or 
n.id < x.id. In either case, {x} UX contains a configuration y € s.JotEst (either n or x) such 
that n.id < y.id < cid. Then Invariant 6.4.5 implies that either y € s.use, or y.id < s.act.idp. 
By Invariant 6.4.3, Part 1, s.act, € s.Jot€st and by definition, s.act, € s.usep. So in either 
case, the hypothesis that m is the totally established configuration with the largest identifier 


belonging to s.use, is contradicted. 


. c€ 8.Att and w ¢ s.Jotést. 


Then z must be newstatE(-), for some p. Let m be the configuration in s.Jot€st with the largest 
identifier that is strictly less than w.id. By the statement for s, we know that there exist 
R' € m.rqrms and W' € m.wqrms such that R' UW’ € w.set and there exist R' € m.rgrms 
and W" € m.wqrms such that R’ UW" € c.set. Hence, by the properties of quorums, there 


is a process g € w.set 1 c.set. 


Since c € s. Ait, there exists a process r such that c € s.attempted,. Thus alsoc € s’.attempted,. 
Since w € s’.Jot€st, we have that w € s'.attempted,. By assumption, there is no configuration 
x € s'.JotEst such that w.id < x.id < c.id. By Invariant 6.4.8 applied to state s’ (with p=), 


we have that there exist R € w.rgrms and W € m.wqrms such that RUW € c.set, as needed. 


0 


So far, the proof that DC-IMPL implements DC has been very similar to the proof that Dvs- 


IMPL implements Dvs. The following invariant is different from the corresponding one in the Dvs 


implementation, Invariant 5.2.17. Here is where dead configurations are needed. 


Invariant 6.4.10 (DC-IMPL) 


In any reachable state, if c,w € Att, w.id < c.id, configuration w is not dead and there is no 


x € JotEst with wid < x.id < c.id, then there exist R € w.rqrms and W € w.warms such that 
RUW Ca.set. 


Proof: By induction on the length of an execution. The base case consists of proving that the 


invariant is true in the initial state. In the initial state, only co is attempted, so the hypotheses 


cannot be satisfied. Thus, the statement is vacuously true. 


For the inductive step assume the invariant is true in state s. We need to prove that it is true in 


s' for any possible step (s,7,s’). Fix c and w, and assume that c € s’.Att, w € s'.Att, w.id < cid, 


w ¢ s'.dead, and there is no x € s'.JotEst such that w.id < x.id < c.id. We consider four cases: 
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1. we€ s.Att and c€ s.Att. Then the invariant is true by the inductive hypothesis. 


2. w ¢ s.Att and c ¢ s.Att. This is not possible because a single action cannot make both w and 


c attempted. 


3. w ¢ s.Att and c € s.Att. Then it must be that m=newconr(w), for some process p’ that 
attempts w. Since c € s.Att we have that c € s'.Att. Hence there exists p such that c € 


s'.attempted,,. 
Now we claim that there must exist a process g € c.set M w.set. 


Clearly we have that w ¢ Jot€st. Let Y = {yly € Totést, y.id < w.id}. We first show that 
Y is nonempty: The initial configuration is totally established, thus cg € Jot€st; moreover 
co.id < w.id. If co.id = w.id, then we have w = co. But then w € Jotést, a contradiction to 
the definition of this case. So we must have cg.id < w.id, which implies that co € Y, so Y is 


nonempty. 


Now fix z to be the configuration in Y with the largest id. We have that there is no x € Jot&st 
with z.id < wid < c.id. Then Invariant 6.4.9 implies that there exist R € z.rgrms and W € 
z.warms such that RUW € w.set and also that there exist R’ € z.rgrms and W' € z.wqrms 
such that RUW € c.set. By the properties of quorums we have that (RUW)N(R'UW’) $ {}. 


Hence we have that there exists gq such that q € c.setN w.set. 


Now we claim that w € s’.attempted,. By contradiction assume that w ¢ s'.attempted,. 
Since c € s'.attempted, and q € c.set we have that s'.info-rcvd[q, c.id]p = (x, X) for some x 
and X. By Invariant 6.4.1 we have that s’.cur.id, > c.id > w.id. Since we assumed that 
w ¢ s! .attempted,, by definition of dead configuration we have that w is dead. But this 
contradicts the hypothesis of the invariant, which states that w is not dead. Hence it must be 


that w € s'.attempted,. 


Hence we have that ¢ € s’.attempted,, q € c.set, w € s'.attempted,, w.id < c.id and there are 
no x € s'.JotEst such that w.id < x.id < cid. By Invariant 6.4.8 we have that there exist 
Rew.rgrms and W € w.wgrms such that RU W C c.set. 


4. we€ s.Att and c ¢ s.Att. Then it must be that 7=nEwconr(c)p for some process p that attempts 
c. We have that s’.attempted,. 


The rest of the proof is as in the previous case: it claims that there exists q € c.set M w.set, 


that w € s'.attempted, and then uses Invariant 6.4.8 to get the needed conclusion. 
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6.4.3. Proof that DC-IMPL implements DC 


We are now ready to prove that DC-IMPL implements DC. We first provide a function mapping states 
of DC-IMPL to states of Dc. Then we will prove that this function is an abstraction function. 
The abstraction function is given in Figure 6-4. 
Let s be a state of TO-IMPL. The state t = Fq-(s) of TO is the following. 
. t.created = Upeps.attempted,, 


. for each p € P, t.cur-cid|p] = s.client-cur.id, 


. for each g € G, t.attempted|g] = {plg = c.id, c € s.attempted, } 


. for each g € G, t.state-dlu|g] = {p|s.estb[g], 

. for each g € G, t.got-state|[g](p) = s.state-got|g](p) p 

. for each g € G, t.condenser|g](p) = s.cond|g]p 

. for each g € G andi € OID,, t.pending|g](i) = 1 if s.pend[g](i), = 1, otherwise is defined 


msg = s.pend[g](i).msg,, 

end = s.pend[g](i).cnd, 

sel = s.pend|g](2).selp 

dlv = {q|s.msg-dlvd|g](4) = true} 

ack = {q|(a,i) € s.to-csq or (a,t) € s.CS.pending|q, g] or s.pend|g](i).ack(q)p = a} 
rsp = s.pend|g|(i).rsp, 


Figure 6-4: The abstraction function Fac. 


In order to prove that Fg, is an abstraction function we need to prove that for any initial state s 
of DC-IMPL we have that Fq,(s) is an initial state of Dc and that for any possible step 7 of DC-IMPL 
there exists a sequence of a@ of steps of DC such that the trace of a, that is the externally observable 
behavior, is equal to the trace of 7. Lemmas 6.4.11 and 6.4.12, prove the above. The proof of these 
lemmas is similar to the corresponding ones of Dvs. The key difference is in the supporting invariant 
used in the proof, Invariant 6.4.10, which is crucial in proving the simulation relation for the case 


when the implementation executes action 7 =nEWwconF(c)p, for some configuration c and some process 
Dp. 

Lemma 6.4.11 [fs is an initial state of DC-IMPL then Fa-(s) is an initial state of DC. 

Proof: Let s be the initial state of DC-IMPL. We have that s.attempted, is {co} for pePo and 9 for 
p ¢ Po. Hence, by definition of Fa. we have that t.created = {co} which is as in the initial state of 
DC. 


We have that s.client-cur.id, is {go} for pePo and | for p ¢ Po. Hence, by definition of Fg. we 
have that t.cur-cid is {go} for pe Py and L for p ¢ Po. This is as in the initial state of DC. 
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We have that s.estb[g], is true for pePo,g = go and false otherwise. Hence we have that 
t.state-dlu|g] is Po if g = go and Q if g 4 go. This is as in the initial state of Dc. 

We have that s.cond[g], = 1 for all g and p. Hence t.condenser|g|(p) = L for all g and p, which is 
as in the initial state of Dc. 

We have that pend[g](i) = 1 for all g and i. Hence t.pending|g](i) = L everywhere, which is as in 
the initial state of Dc. 


Hence if s is an initial state of DC-IMPL, we have that Fq,(s) is an initial state of DC. O 


Lemma 6.4.12 Let s be a reachable state of DC-IMPL, Fa-(s) a reachable state of DC, and (s,7, s') 
a step of DC-IMPL. Then there is an execution fragment a of DC that goes from Fa.(s) to Fa-(s'), 


such that trace(a) = trace(z). 


Proof: By case analysis based on the type of the action 7. (The only interesting case is where 7 = 
NEWCONF(v)p.) Define t = Fac(s) and t'! = Fq-(s'). 
1. 7 = CS-CREATECONF(c) 
Then trace((s,7, s')) = . Action 7 modifies created. The definition of Fy, is not sensitive to 
this change. Therefore, ¢ = ¢’, and we set a = t. 
2. 1 = CS-NEWCONF(c)p 
Then trace((s,7,s')) = A. Action a modifies current-confid|p], cur, and info-sent|cur.id]p, 
and adds an “info” message to to-cs|cur.id],. The definition of Fy. is not sensitive to any of 
these changes. Therefore, t = t’, and we set a =t. 
3. 7 = CS-GPSND(m)p 
Then trace((s,7, s')) = A. Action 7 just moves a message from the queue to-cs[cur.id], to the 
queue CS.pending|p, current-confid[p]]. The definition of Fg, is not sensitive to this change. 
Therefore, t = t', and we set a = t. 
4. 7m = Cs-ORDER(m, p, g) 
Then trace((s,z,s')) = ». Action moves a message from CS.pending|p, g] to CS.queuelg]. 
The definition of Fg, is not sensitive to this change. Therefore, ¢ = ¢’, and we set a =t. 
5. W = cS-GPRoV(({“info”,c,C))q,p 


Then trace((s,7,s’)) = . Action 7 increments the nezt pointer in CS, and may modify act 
and amb in CS-TO-DC. The definition of Fy, is not sensitive to this change. Therefore, t = 1’, 


and we set a =t. 


6. 7 = CS-SAFE((“info” ,c,C))q,p 
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Then trace((s, 7, s')) = A. Action 7 increments the next-safe pointer in cs. The definition of 


Fac is not sensitive to this change. Therefore, t = t’, and we set a = t. 


. 1 = NEWCONF(C)p 


Then trace((s,7,s)) = 7. In DC-IMPL, this action modifies only variables amb,, attempted, 
client-curp. We have s'.client-cur, = c and s'.attempted, = s.attempted, U {c}. By definition 
of Fac, we have that t'.cur-cid|p] = s'.client-cur.idp, t'.created = Upeps'.attempted, and 
t'.attempted|c.id] = {plc € s'.attempted,,}. Hence we have that t’.cur-cid[p] = c.id, t'.created = 
t.created U {c} and t'.attempted|c.id] = t.attempted[c.id] U {p}, while all other state variables 


in t’ are as in t. 


We consider two cases: 


(a) ¢ € t.created. 
In this case, we set a = (t,7', t’), where z’ = newconr(c)p. The code shows that x’ brings 
DC from state t to state t’. It remains to prove that 7’ is enabled in state ¢, that is, that 
c € t.created and c.id > t.cur-cid[p]. The first of these two conditions is true because of 
the defining condition for this case. The second condition follows from the precondition of 
a in DC-IMPL: this precondition implies that cid > s.client-cur.id,, and by the definition 


of Fac we have t.cur-cid|p] = s.client-cur.idy. 


(b) c ¢ t.created. In this case we set a = (t,7',t", 2", t'), where 7! =crEaATECONF(c)p, 7 =NEWCONF(c)p, 
and t” is the unique state that arises by running the effect of x’ from t. The code shows 
that a brings Dc from state ¢ to state t’. It remains to prove that 7‘ is enabled in ¢ and 


that a” is enabled in t”. 


e Action xz’. We start by proving that 7’ is enabled in t. 

The precondition of x’ requires that (i) Vw € t.created, c.id # w.id and (ii) if c is not 

dead, the following two conditions, Cl and C2, are true. 
C1: Vw € t.created, w.id < c.id, either w is dead or dx € t.Jot€st satisfying w.id < 
z.id < c.id or there exist R € w.rgrms and W € w.wqrms such that RU W C c.set; 
C2: Vw € t.created, c.id < w.id, either w is dead or dz € t.Jot€st satisfying cid < 
z.id < w.id or there exist R € c.rgrms and W € c.wgrms such that RUW C w.set. 
— requirement (i). To see requirement (i), suppose for the sake of contradiction 
that there exists w € t.created such that w.id = c.id. The precondition of 7 
in DC-IMPL implies that c = s.cur,, which implies that c € s.CS.created. Since 
w € t.created, the definition of Fg, implies that w € s.attempted, for some q. This 
implies that w € s.cS.created. Hence both c and w are created, that is, belong to 
t.created and since w.id = c.id we have that c = w. But this is impossible since 


c ¢ t.created and w € t.created. 
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— requirement (ii). If c is dead, then requirement (ii) is trivially satisfied. Hence 
assume that c is not dead. We have to show that both Cl and C2 are true. 
Let us start with C'l. Assume that there exists w € t.created such that w.id < 
c.id, that w is not dead and that there is no x € t.Jot€st satisfying w.id < rid < 
c.id (otherwise C'l is true and we are done). Since w € t.created, by definition 
of Fac, w € s.attempted, for some q. Clearly, w € s! .attempted,. ‘Therefore, 
w € s'.Att. By the code of a we have that c € s'.attempted,. Therefore we also 
have c € s'.Att. Moreover, there is no x € s’.Jot€st satisfying w.id < x.id < c.id 
(this is true in ¢ and thus, by definition of F4g., is true in s and, because 7’ does 
not establish any configuration, it stays true in s'). By Invariant 6.4.10 we have 
that there exist R € w.rgrms and W € w.wqrms such that RUW C c.set, as 
needed to prove C1. 
We look now at C’2. Assume that there exists w € t.created such that c.id < w.id, 
and that there is no « € t.Jot€st satisfying c.id < x.id < w.id. We already know 
that c is not dead. Since w € t.created, by definition of Fg., w € s.attempted , for 
some q. Clearly, w € s'.attempted,. Therefore, w € s’.Att. By the code of 7 we 
have that c € s'.attempted,. Therefore we also have c € s’.Att. Moreover, there 
is no x € s'.Jot€st satisfying cid < x.id < w.id. By Invariant 6.4.10 we have 
that exist R € c.rgrms and W € c.wgrms such that RUW C w.set, as needed to 
prove C2. 


This proves that x’ is enabled in t. 


e Action z'’. Now we prove that 7” is enabled in state ¢”’. 
The precondition of x" requires that c € t’.created and c.id > t'’.cur-cid[p]. The first 
condition is true because c is added to created by x’. The second condition follows 
from the precondition of 7 in DC-IMPL: The precondition of 7 implies that c.id > 
s.client-cur.id,. The definition of Fy, implies that t.cur-cid[p] = s.client-curidp. 
Moreover, t”.cur-cid[p] = t.cur-cid[p]. It follows that cid > t".cur-cid[p]. Thus 2” 


is enabled in state t”. 


8. 7 = SUBMIT-STATE(0, $)p 


Then trace((s,7,s)) = a. This action sets state-got|g|(p), := 0, cond[g|, := ¢ and appends 
(“state-got”,o) to to-cs[g|,, where g = client-cur,. By the definition of Fg, we have that 
t'.got-state|g](p) = 0 and t’.condenser|g|(p) = ¢, while all other state variables are as in t. We 
set a = (t,7,t'). The code shows that 7 actually brings DC from ¢ to t’. Moreover 7 is an 


input action, so it is always enabled. 


9. 7 = cs-GPRCV((“state-got”,o 
4:P 
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Then trace((s,7,s')) = ». Action a sets state-got[g](q)p := 0. The definition of Fy. is not 
sensitive to this change. Therefore, ¢ = ¢', and we set a =t. 

10. 7 = cs-sAFE((“state-got” ,0))q,p 
Then trace((s, 7, s')) = A. Action 7 increments the next-safe pointer in cs. The definition of 
Fac is not sensitive to this change. Therefore, t = t’, and we set a = t. 

11. a = NEWSTATE(0)p 


Then trace((s,7,s)) = 7. This action sets estb[g] := true and appends the message “established” 
to to-cs[g], where g = cur.id,. By definition of Fy. we have that t'.state-dlu[g] = t.state-dlu[g|U 
{p} and this is the only difference between ¢' and t. We set a = (t,7z,t'). The code shows that 
m actually brings Dc from ¢ to ¢’. It remains to prove that 7 is enabled in t. The precondition 
of 7 in DC-IMPL are the same as those in DC. Thus z is enabled in DC because it is enabled in 


DC-IMPL. 

12. 7 = cs-GprRov( “established” )p 
Then trace((s,7, s')) = A. Action m sets rcvd-estb[q, cur.id,], := true. The definition of Fa. 
is not sensitive to this change. Therefore, t = t’, and we set a =t. 

13. 7 = os-sAFE( “established” ) p 
Then trace((s,7,s')) = A. Action 7 increments the nezxt-safe pointer in CS. The definition of 
Fac is not sensitive to this change. Therefore, t = t’, and we set a = t. 

14. m7 = GARBAGE-COLLECT(c)p 
Then trace((s, 7, s')) = A. This action can modify act, and amb,. The definition of Fy, is not 
sensitive to these changes. Therefore, ¢ = t’, and we set a =t. 

15. a7 = suBMIT(m, 4, 6, i)p 


Then trace((s,7,s)) = 7. This action sets pend|g] := (m, ¢,6,9,1,false) and appends (m, i) 
to to-cs|g], where g = client-cur.id, # L. The definition of Fy. shows that this changes the 
queue of pending operations pending so that t’.pending|g|(i) = (m,¢,b,@, L,false) and this 
is the only difference between ¢ and t’. We set a = (t,7,t'). The code shows that 7 actually 


brings Dc from ¢ to ¢’. Moreover 7 is an input action, so it is always enabled. 


16. a = cs-GPRoV(m,1)p 


Then trace((s, a, s')) = A. Action x appends (m,2) to dlu-queue|cur.idy|,. The definition of 


Fac is not sensitive to this change. Therefore, ¢t = t’, and we set a = t. 


17. 7 = CS-SAFE(m, i) p 
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Then trace((s, 7, s')) = A. Action 7 increments the next-safe pointer in cs. The definition of 


Fac is not sensitive to this change. Therefore, t = t’, and we set a = t. 


18. a = DELIVER(™, i)p 


Then trace((s,7,s)) = 7. This action deletes the head of dlv-queue[g], and sets msg-dlvd[g](4) := 
true, where g = cur.id,. By the definition of Fy. we have that the only thing that changes 
from t to t’ is pending[g](i).dlu, that is t’.pending|g](z).dlu = t.pending[g](i).dlu U {p}. We set 
a = (t,7,t'). The code shows that 7 actually brings Dc from ¢ to t’. It remains to prove that 
a is enabled in t. From the precondition of 7 we have that s.msg-dlvd|g](i)p = false and thus 
we have that p ¢ t.pending|g](i).dlv. From the precondition of 7 we have that (m,i) is the 
head of dlv-queue[g],. By Invariant 6.4.7 we have that t.pending[g](i).msg, =m. Hence 7 is 


enabled in state t of DC. 


19. 7 = ACK-DLVR(a, i)p 
Then trace((s,7,s)) = 2. This action appends (a,i) to to-cs[g],) where g = client-cur.idy. By 
the definition of Fy. we have that the only thing that changes from t to ¢’ is pending|g](i).ack, 
that is t'.pending|g](i).ack = t.pending[g](i).ack U {p}. We set a = (t,7,t'). The code shows 
that 7 actually brings Dc from ¢ to t’. Moreover z is enabled in t because it is an input action. 

20. 7 = cs-GPROV(a, i) p 
Then trace((s, 7, s’)) = A. Action m sets (m,%) to dlu-queue[cur.id,],. The definition of Fy. is 
not sensitive to this change. Therefore, t = t’, and we set a =t. 

21. 7 = cs-SAFE(a, i) p 
Then trace((s, 7, s')) = A. Action 7 increments the next-safe pointer in cS. The definition of 
Fac is not sensitive to this change. Therefore, t = t’, and we set a = t. 

22. 7 = RESPOND(?,i)p 


Then trace((s,7,s)) = a. This action sets pend[g](i).rsp := true. By the definition of Fa. we 
have that t'.pending|g|(i).rsp = true. We set a = (t,7,t'). The code shows that 7 actually 
brings DC from ¢ to t'. It remains to prove that 7 is enabled in ¢. The precondition of 7 in 
DC-IMPL is as the precondition of 7 in DC. Hence z is enabled in Dc because it is enabled in 


DC-IMPL. 


O 


Lemmas 6.4.11 and 6.4.12 prove that Fg, is an abstraction function from DC-IMPL to DC and thus 


the following theorem holds. 


Theorem 6.4.13 Every trace of DC-IMPL is a trace of DC. 
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6.5 An application of DC 


In this section we show how to use DC to implement an atomic multi-writer multi-reader shared 
register. The algorithm is an extension of the single-writer multi-reader atomic register of Attiya, 


Bar-Noy and Dolev [12]. A similar extension was provided in [66]. 


6.5.1 Overview 


In this section we provide a description of the algorithm and the code. We start with the description 
of the algorithm. 

Each process keeps a copy of the shared register, in variable val paired with a tag, in variable 
tag. Tags are used to establish the time when values are written: a value paired with a bigger tag 
has been written after a value paired with a smaller tag. Tags consists of pairs (j,p) where j is a 
sequence number (a non negative integer) and p is a process identifier. Tags are ordered according 
to their sequence numbers with processes identifiers breaking ties. Given a tag (j, p) the notation 
t.seq denotes the sequence number 7. 

The algorithm has two modes of operation: a normal mode and a reconfiguration mode. The latter 
is used to establish a new configuration. It is entered when a new configuration is announced (action 
newconF) and is left when the configuration becomes established (action newsrats). The former is 
the mode where read and write operations are performed and it is entered when a configuration is 
established and is left when a new configuration is announced. During the reconfiguration mode 
pending operations are delayed until the normal mode is restored. Variable conf-status is used to 
keep track of the mode (values exch-ready, exch-wait are for the reconfiguration mode). 

Clients of the service can request read and write operations by means of actions reap, and 
WRITE(z)p. We assume that each client does not invoke a new operation request before receiving the 
response for the previous request. Both type of requests (read and write) are handled in a similar 
way: there is a query phase and a subsequent propagate phase. During the query phase the server 
receiving the request “queries” a read-quorum in order to get the value of the shared register and 
the corresponding tag for each of the members of the read-quorum. From these it selects the value 
x corresponding to the max tag t. This concludes the query phase. In the propagation phase the 
server sends a new value and a new tag (which are (x,t) for the case of a reap, Operation and 
(y, (t.seq + 1,p)) for a wrirz(y)» Operation) to the members of a write quorum. These processes 
update their own copy of the register if the tag received is greater than their current tag; then they 
send back an acknowledgment to the server p. When p gets the acknowledgment message from the 
members of a write quorum, the propagate phase is completed. At this point the server can respond 
to the client that issued the operation with either the value read, in the case of a read operation, or 


with just a confirmation, in the case of a write operation. 
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We remark that when a configuration change happens during the execution of a requested oper- 


ation, 


query 


the completion of the operation is delayed until the normal mode is restored. However if the 


phase has already been completed it is not necessary to repeat it in the new configuration. 


We denote by T = {(j, p)|j € N,p € P} the set of tags. This set is ordered according to the first 


component, with the process identifiers breaking ties. We denote by ¥ the set of values that the 


shared register can assume. We assume that there is a default value x € ¥. Initially all members 


of cg have the shared copy of the register set to zo. 


All other data types used in the code are as defined for DC. 


Figures 6-5 and 6-6 provide the code of ABD-CODE,; in this code code, we use the following 


conde 


nser functions: 


%maztag Which computes the value and the tag of the max tag register copy. Formally this 
function takes a collection of tuples Z = {(“query-ack” ,v,t,h)}, with t € J and returns one 
such quadruple which has a maximum tag t among the elements of Z together with the set Q 


of processes that submitted the tuples; formally it returns a tuple (“query-ack” ,v, t, h, Q). 


Pack Which returns an acknowledgment. Formally this function takes a collection of pairs 
Z = {(“prop”,t)}, all of which have the same value t € J, and returns (“prop-ack” , t, Q), 


where Q is the set of processes that submitted the pairs. 


state Which computes the up to date state for a new configuration. Formally it takes a 
collection of triples Z = {(a, t, h)} where zx is a value, ¢t a tag and h a configuration identifier, 
considers the subset Z’ of the triples of z with maximum h and returns the first two components 
of a triple of Z’ which has the maximum ¢t in Z'. Such a triple can be picked by default, choosing 


e.g., the one that came from the process with the smallest identifier. 


ABD-CODE (signature and state) 


Signature: 

Input: READp, p€ P Output: SUBMIT(m, ¢,b,i)p,m € M, d€ ©, 
WRITE(z)p, TEX, pEP be {“r”, “w”}, p€ P,i € OIDp 
DELIVER(M,i)p, m € M,i€ OID, pEP ACKDLVR(a,7)p,a€ A, ic OID, pEP 
RESPOND(a,i)p, a € A, i € OIDy, pEP READ-CONFIRM(@Z)p, 2E€ X,pEP 
NEWCONF(c)p, c € C, p € c.set WRITE-CONFIRMp, p € P 
NEWSTATE(S)p, 5 € S, SUBMIT-STATE(s, ¢)p, 5E S, GE ®, 

State: 


current € C, init co if p € Po else L 


high € 


val € &, initially ro 

tag € T, initially (0, p) 
prop-val € X, initially zo 
prop-tag € T, initially (0,p) 


conf- status € {normal, exch-ready, exch-wait}, init normal 
status € {query-ready, query-wait, prop-ready , 
prop-wait, prop-done}, init query-ready 
request € {“read” ,(“write”, x), L}, init L 
ack-q, seqof(A x OID), init » 


G., init go if p € Po else L 


Figure 6-5: The ABD interface and state. 


We define the system ABD-SYS as the composition of DC and ABD-CODE, for each p € P. 
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ABD-CODE (transitions) 


Actions: 
input READ» output WRITE-CONFIRM, choose x 
Eff: request := “read” Pre: conf-status = normal 
status = prop-done 
input WRITE(z)p request = (“write” , x) 
Eff: request := (“write”, x) Eff: request := L 


status := query-ready 
output SUBMIT(“query”, Pmaztag, “T”,14)p 


Pre: current # L input DELIVER(“query”,1)p 
i € used-ids Eff: append ((“query-ack”, val, tag, high),i) to ack-q 
conf-status = normal 
status = query-ready input DELIVER((“prop”,, t),7)p 
request Z L Eff: if ¢ > tag then 
Eff: used-ids := used-ids U {i} val := x 
status := query-watt tag:=t 


append ((“prop-ack” ,t),i) to ack-q 
input RESPOND((“query-ack”, z,t,h,Q),1)p 


Eff: if request = “read” then output ACKDLVR(a,i)p 
prop-tag :=t Pre: head(ack-q) = (a,i) 
prop-val := x Eff: ack-q := tail(ack-q) 
if request = (“write”, y) then 
prop-tag := (t.seq + 1,p) input NEWCONF(c)p 
prop-val := y Eff: current :=c 
status := prop-ready conf-status := exzch-ready 
if status = query-waitt then 
output SUBMIT((“prop”, x,t), Pack, “w”;t)p status := query-ready 
Pre: current # L if status = prop-wait then 
i € used-ids status := prop-ready 
conf-status = normal ack-q := » 
status = prop-ready 
x = prop-val output SUBMIT-STATE((z, t, h), Pstate )p 
t = prop-tag Pre: conf-status = exch-ready 
Eff: status := prop-wait x= val 
t= tag 
input RESPOND((“prop-ack” ,t, Q),i)p h= high 
Eff: status := prop-done Eff: conf-status := exch-wait 
output READ-CONFIRM(2)p input NEWSTATE(Z, t)p 
Pre: conf-status = normal Eff: conf-status := normal 
status = prop-done if t > tag then 
request = “read” val = & 
« = prop-val tag :=t 
Eff: request := L high := current 


status := query-ready 


Figure 6-6: The ABD-CODE transitions. 
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6.5.2 Proof that ABD-SYS is an atomic register 


In this section we prove that ABD-SYS implements an atomic read/write shared register. This proof 
uses an approach similar to that used in Chapter 5 and in [41] to prove the correctness of applications 
built on top of Dvs and VS, respectively. 

We need the following history variables. 


For a process p and a configuration identifier g variable buildtag|p, g| € Ti. is defined as follows: 
e If current.id, = g then buildtag[p, g| = tag,; 


e If current.id, > g then buildtag[p, g] is the value of tag, at the moment when process p leaves 


configuration g; 
e If current.idp < g then buildtag|[p, g] = L. 


Informally this buéldtag[p, g] records the value of the latest tag, if any, used in configuration g by 
process p. 
The value of buildtag[p, g] can be easily computed by following the statement that modifies tag, in 
actions peLiver and newstats, with another statement buildtag[p, current.idy| := tag,. It should be 
clear that (4) when p is in configuration g we have that buildtag[p, g) = tag, and (ii) after p leaves 
configuration g forever afterwards, buildtag[p, g] contains the value of tag, at the point when p left 
configuration g. 

We also need history variables to record the beginning and the end of query and propagate 
phases of each operation, the configurations where the phases are executed, the quorum of precesses 
involved, as well as the tags returned by the query phases and the ones written by propagate phases. 


We define the following history variables: 


e query-begin|i] € {true, false}, initially false for all i € OID. This variable is set to true 
when action suBMiT(“query”, dmaztag, “T”,1)p is executed by some process p. Informally, when 


query-begin|i] = true the query phase of operation i has started. 


e query-end[i] € {true, false}, initially false for all i € OID. This variable is set to true 
when action RESPOND((“query-ack”, x, t,h,Q),i)p is executed by some process p and for any value 
of z,t,h and Q. Informally, when query-end|i] = true the query phase of operation i has been 


completed. 


© query-quorum|i] € 27, initially L for all i € OID. This variable is set to Q when action 
RESPOND((“query-ack”, z, t,h,Q),i)p is executed by some process p and for any value of x,¢ and h. 


Informally, query-quorum|i] records the read quorum used by the query phase of operation i. 


e query-conf|i] € C., initially L for all i € OID. This variable is set to c when action 


RESPOND((“query-ack”,,t,h,Q),t)p is executed by some process p when current, = c, and for 
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any value of x,t, h and Q. Informally, variable query-conf |i] records the configuration in which 


the query phase of operation i is performed. 


e query-tag|i] € 71, initially L for alli € OID. This variable is set to ¢ when action REsPonp((“query-ack”, x, h,t,Q),1)p 
is executed by some process p and for any value of x, h and Q. Informally, variable query-tag[i] 


records the tag returned by the query phase of operation 7. 


e prop-begin|i] € {true, false}, initially false for all i € OID. This variable is set to true 
when action supmit((“prop”, x,t), Pack; “W”,t)p is executed by some process p and for any value 
of x and t. Informally, when prop-begin|i] = true the propagation phase of operation i has 


started. 


e prop-end|i] € {true, false}, initially false for all 1 € OID. This variable is set to true when 
action RESPOND((“prop-ack” ,t, Q),i)p is executed by some process p and for any value of ¢ and Q. 


Informally, when prop-end[i] = true the propagation phase of operation i has been completed. 


e prop-quorum|i] € 2° 1, initially 1 for all i € OID. This variable is set to Q when action 
RESPOND((“prop-ack”,t, Q),4)p is executed by some process p and for any value of t. Informally, 


prop-quorum|i] records the write quorum used by the propagate phase of operation i. 


e prop-conf |i] € C_, initially 1 for alli € OID. This variable is set to c when action rEsPonp((“prop-ack” , t, Q), i) 
is executed by some process p when current, = c and for any value of t and Q. Informally, 
variable prop-conf |i] records the configuration in which the propagate phase of operation i is 


performed. 


e prop-tag|i] € T_, initially L for alli € OID. This variable is set to t when action rEsponp((“prop-ack” , t, Q), i) 
is executed by some process p for any value of Q. Informally, variable prop-tag[i] records the 
tag written during the propagate phase of operation 7; this is also the tag associated with 


operation 7. 


We define the set of summaries as the set Sum = {(z,t,g)|x« € V¥,t € T,g € G}. Given a 
summary y € Sum, y = {a,t,g) selectors for the components are y.value = x, y.tag = t and 
y.high = g. 

Informally a summary y = (x,t, g) is used to record that some process p has been into a state in 
which val, = x, tag, =t and high, = g. 

We write allstate[p, g] to denote a set of summaries defined so that (x,t, h) € allstate|[p, g] if one 
of the following holds. 


1. current, =g and val, = x,tag, =t and high, = h. 


2. got-state[g|(p) = (a, t, h) 
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3. Message (“query-ack” , x,t, h) is somewhere in message mechanism of DC, more formally: 


e ((“query-ack” ,z,t,h),1) € ack-q,, for some operation identifier 7; 


e pending|g|(i).ack(p) = (“query-ack” ,x,t,h), for some operation identifier i. 
4. pending[g](t).msgp = (“prop” , x, t, h), for some operation identifier 4. 


We write allstate[g] to denote U,¢p allstate[p, g], and allstate to denote U,¢g allstate|g]. 
If Y is a partial function from process identifiers to summaries, then we define: mazprimary(Y) = 
MAX ge dom(Y) LY (q)-high}, reps(Y) = {q € dom(Y) : Y(q).high = mazprimary} and chosenrep(Y) 


denotes some element q’ in reps(Y) that maximizes Y (q’).tag. 


Next we provide some preliminary invariants. Since these invariants state simple facts and also 
some of them are very similar to the ones used in [41], we provide operational proofs instead of 


formal assertional proofs. 


Invariant 6.5.1 (ABD-SYS) 


In any reachable state, we have that current.id, = cur-cid,. 


Proof: Both variables are initially go for p € Po and L for p ¢ Py. Both variables are set to c.id 


when action NEwconF(c)p is executed. 0 


Invariant 6.5.2 (ABD-SYS) 
In any reachable state, if query-end[i] = true or prop-end|i] = true, then we have that prop-conf |i] € 


created \ dead. 


Proof: By definition of prop-conf, query-end and prop-end we have that if query-end|i] = true or 
prop-end|i] = true, then prop-conf |i] #4 L. Let c = prop-conf |i]. Clearly c must be created. 

The query phase and the propagate phase are executed in normal processing, that is when conf-status = 
normal for each process involved. Thus processes participating in the query or in the propagate phase 
have executed action newstate for configuration c. Such an action is executed only when all members 
of c have submitted their state to the DC service. In order to submit their state for configuration c, 


each member must have executed action newconr(c). Hence c is not dead. O 


Invariant 6.5.3 (ABD-SYS) 
In any reachable state the following are true. 


1. high, € Tot Att 


2. If y € allstate then y.high € Tot Att. 
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Proof Sketch: For any summary y configuration y.high is a configuration which has been the high, 
for some process p. Hence it suffices to prove Part 1. Variable high, is set only by action newstatE(-)p. 
This action is executed only if all the members of the configuration have submitted their state to 


the Dc service. This implies that all the members of the configuration have attempted high,,. 0 


Invariant 6.5.4 (ABD-SYS) 


In any reachable state: 
1. high, ¢ dead 


2. If y € allstate then y.high €¢ dead. 


Proof Sketch: This invariant follows easily from the previous one since a totally attempted config- 


uration cannot be dead. oO 


Invariant 6.5.5 (ABD-SYS) 
In any reachable state, the following is true: If c € created \ dead and Jy € allstate such that 
y.high > c.id then there exists R € c.rgrms such that for all p € R, current.id, > c.id. 


Proof: Let configuration c and y € allstate be such that c € created \ dead and y.high > c.id. 
By Invariant 6.5.3 we have that y.high € JotAtt. Then By Invariant 6.3.3, we have that there 
exists R € c.rgrms such that for all p € R, cur-cid, > cid. By Invariant 6.5.1 we have that 
current, = cur-cid,y. It follows that there exists R € c.rgrms such that for all p € R, current, > cid. 


0 


Invariant 6.5.6 (ABD-SYS) 
In any reachable state, the following is true. Let c,w be two configurations such that cid < w.id 
andté€ T. Let p € c.setNw.set and let y = got-state|w.id|(p). If buildtag|p,c.id] >t, y # L and 
y.high > c.id, then y.tag > t. 


Proof Sketch: Since y = got-state|w.id](p) # -_ we have that current.id, > w.id > c.id. Since 
y.high > c.id we have that y.tag > buildtag|p,c.id]. By assumption buildtag[p, c.id| > t. Hence we 
have that y.tag > t. | 


The following invariant is the analog of Invariant 6.13 of [41]. 


Invariant 6.5.7 (ABD-SYS) 
In any reachable state, for any p, for any summary y and for all c,w € created we have that: If 


state-dlu[p, cid] 4 L, cid < w.id and y € allstate[p, w.id] then y.high > c.id. 
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Proof Sketch: Assume that p,y,c and w satisfy the hypothesis. Since state-dlu[p, c.id] # L, we 
have that process p has executed action newstaTE(-)p for configuration c. When executing this action 
it sets high, := c. Hence any summary due to p for a later configuration has the high component 
which is at least c.id. This is true also for y which is a summary due to p for configuration w, since 


w.id > c.id. Oo 


Invariant 6.5.8 (ABD-SyYS) 


In any reachable state, if got-state[g|(p) A L then current.id, > g. 


Proof Sketch: Since got-state|g](p) # L we have that p submitted its state to DC. In order for 
process p to submit its state for configuration g it must be that current, = c, where c.id = g. 


Afterwards, by monotonicity of configuration identifiers, we have that current.id, > g. O 


Next we provide a sequence of invariants which leads to the proof that ABD-SYS implements an 


atomic read/write shared register. 


Invariant 6.5.9 (ABD-SYS) 
In any reachable state, the following is true. Let c € created \ dead, W € c.wgrms and lett € T. If 
for every r € W such that current.id, > c.id it holds that estb|c.id], = true and buildtag[r, c.id] > t, 


then we have that every summary y € allstate with y.high > c.id satisfies y.tag > t. 


Proof: By induction on the length of the execution. In the initial state, the only created configu- 
ration is cg, and there are no summaries y with y.high > go. So the invariant is vacuously true and 
the base case is proved. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s’ for any step (s,7,s’). To prove that the invariant is true in s’ we fix 
c € s'.created\s'.dead, W € c.wqrms, and t € 7, and assume that for every r € W, if s'.current.id, > 
c.id then s'.estb[c.id], and s'.build-tag[r,c.id] > t. To prove the invariant we need to prove the 


following conclusion: for any summary y € s’.allstate such that y.high > c.id, we have that y.tag > t. 


Let us first consider the case when c ¢ s.created. Since c € s'.created, action 7 must be 


cREATECONF(c). We consider two subcases. 


1. dr € c.set : s.current.id, > c.id. 


Fix such a process r. Since c has just been created, r has not attempted c in s so c € s.dead, 
which implies c € s'.dead. This contradicts the assumption that c € s’.created \ s'.dead. So 


this case is not possible. 


2. Ar € c.set : s.current.id, > c.id. 
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In this case, we claim that the invariant is true because there is no y in s‘.allstate with 
y.high > c.id. By contradiction, fix a y € s'.allstate such that y.high > c.id. By Invari- 
ant 6.5.4 configuration y.high is not dead. Then Invariant 6.5.5 applied to s’ implies that there 
exists R € c.rgrms such that for all g € R, s'.current.id, > c.id. Fix some q € R. Since 
s.current.id, = s'.current.id,, it follows that s.current.id, > c.id. But q € c.set and thus we 


have a contradiction of the defining condition for this case. 


Hence in the case when c € s.created the invariant is true. For the rest of the proof we assume 
that c € s.created. Since c ¢ s’.dead we have that c ¢ s.dead. 

As usual, the interesting steps are those that convert the hypothesis from false to true, and those 
that keep the hypothesis true while converting the conclusion from true to false. There are no steps 
that convert the hypothesis from false to true. So it remains to consider any steps that keep the 
hypothesis true while converting the conclusion from true to false. Thus, we assume that, for every 
r € W, if s.current.id, > c.id then s.estb[c.id], = true and s.buildtag|[r, c.id] > t. The only steps 
that can convert the conclusion from true to false are steps that produce a new summary (because if 
asummary y € s.allstate has y.high > c.id, then by the inductive hypothesis we have that y.tag >t 
and we are done.) 

Any step that produces a summary y by modifying an old summary y' € s.allstate, in such a 
way that y’.tag < y.tag and y'.high = y.high, is easy to handle: For such a step, y'.high > c.id and 
so the inductive hypothesis implies that ¢ < y'.tag < y.tag, as needed. So the only concern is with 
a NEWSTATE action for some configuration w. 

Hence we assume that 7 =newstatx((@,f,h))p for some process p such that s.current, = w. 
Action 7 produces the following new summary y = (s'.valp, s'.tag,, s'.high,) and since, by the code 
of 7, s' high = w.id we have y.high = w.id. Assume that y.high > c.id (otherwise we are done). In 
order to prove the invariant we have to prove that y.tag > t. 

Since y.high > c.id and y.high = w.id, we have that w.id > c.id. We also notice that con- 
figuration w is not dead in s. Indeed by the code of 7 we have that s'.high, = s'.current, and 
since s'.current, = s.currentp = w we have that w = s'.high,. By Invariant 6.5.4 we have that 
w ¢ s'.dead. Clearly w ¢ s.dead. 

Let Y = s.got-state|w.id] and let p' = chosenrep(Y). Let y' be the summary y' = s.got-state|w.id|(p’). 


Before proving that y.tag > ¢ we prove two claims that are needed for the proof. 


e CLAIM 1. y'. high > c.id. 


Let c’ denote the configuration which has the highest identifier in the set of configurations 


{c"|c" € s' TotEst, cl! id < w.id}. 
Remember that w ¢ dead and that c.id < w.id. 


We consider two possible cases: 
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1. cid > cid 
Since c' € s'.Jot&st, we have that c' ¢ s'.dead (and thus c' ¢ s.dead). Also w ¢ s'.dead. 
By definition of c', we have that there are no totally established configurations in between 
c' and w. Then Invariant 6.3.1 shows that there exists R € c'.rgrms such that R C w.set. 
Fix any g € R. Since c' € s’.Jot€st we have that s.state-dlv|g,c'.id] # L. Let y” = 
s.got-state|w.id](q). By Invariant 6.5.7, we obtain that y” high > c'.id. By the definition 
of p' as a member that maximizes the high component in the summary recorded in 
got-state, we have y'.high > y" high. Therefore y’.high > cid > c.id, completing our 
proof of the claim for this case. 

2. cid < cid 
By assumption, c ¢ dead. We have observed above that w ¢ dead. By definition of 
c', we have that there are no totally established configurations in between c' and w and 
since c'.id < c.id < w.id it follows that there are no totally established configurations in 
between c and w. Then Invariant 6.3.1 shows that there exists R € c.rgrms such that 
RCw.set. We have that RNW # @. Let q be any element of RN W. Since R C w.set, 
we have that g € w.set. Because s.got-state[w.id](q) # L, Invariant 6.5.8 implies that 
s.current.idg > w.id. Since w.id > cid we have that s.current.id, > c.id. 
Recall that we have assumed that for every r € W, if s.current.id, > c.id then s.estb[c.id], = 
true and s.buildtag|r, c.id] > t. Therefore, since g € W and s.current.id, > c.id, we have 
that s.estb[c.id], = true and thus s.state-dlv[q, cid] A L. 
Let y” = s.got-state|w.id](q); thus y” € s.allstate|g, w.id]. By Invariant 6.5.7, we obtain 
that y" high > c.id. By the definition of p’ as a member that maximizes the high compo- 
nent in the summaries recorded in got-state[|w.id], we have y'.high > y" high. Therefore 


y' high > c.id, completing our proof of the claim for this case. 
Thus we have proved that y'.high > c.id. 


CLAIM 2. If y'.high = c.id, then in s’ there is no totally established configuration w’ such that 


cid < wid < w.id. 


To see this, consider again the totally established configuration c' with the largest identifier 
less than w.id. By the definition of c' it suffices to prove that c!.id < c.id. 

Neither c' nor w are dead. Since there are no totally established configuration in between c’ 
and w, Invariant 6.3.1 implies that w.set contains a read-quorum of c’, and thus an element of 
c'.set. That is, there exists g € c'.setN w.set. Consider the summary y"’ = s.got-state[w.id](q). 
By the precondition of 7 we have y” # 1 and thus we have that y” € s.allstate[q, w.id]. By 
definition of c’ as the totally established configuration with the largest identifier less than w.id, 
we have that cl.id < w.id and that state-dlu|g,c'.id] # L. Then Invariant 6.5.7 shows that 


107 


y" high > c'.id. The definition of p’ as a member that maximizes the high component among 
the summaries recorded in s.got-state[w.id], shows that y’.high > y" high > c'.id. But the 
claim is conditional to the hypothesis that y'.high = c.id. So if y'.high = c.id we have that 


c.id > c'.id, which gives the claim. 


Hence we have proved that if y’.high = c.id in s' there is no totally established configuration 


w' such that c.id < w'.id < w.id. 


We are now ready to prove that y.tag >t. By Claim 1, we have that y’.high > c.id. 

If y'.high > c.id, by the inductive hypothesis we have that y'.tag > t and since y.tag > y'.tag, 
we have that y.tag > t, as needed. 

So suppose y' high = c.id. By Claim 2, in s‘ there is no totally established configuration w’ such 
that cid < w'.id < w.id. 

We know that c ¢ s.dead and that w ¢ s.dead. Thus by Invariant 6.3.1 we have that there 
exists R € c.rqrms such that R C w.set; therefore, since RN W # @, there exists qe WN w.set. 
By the precondition of 7, using the fact that gq € w.set, we have s.got-state|w.id|(q) # L and thus 
s.current, > w.id > c.id. Thus also s’.current, > c.id. We have that q € W and s'.current, > c.id; 
for such a process we have that s’.buildtag[q,c.id] > t and s'.estb[c.id], = true. Since action 7 
does not modify these variables we have that s.buildtag[gq,c.id] > t and s.estb[c.id], = true. The 
precondition of 7 shows that s.got-state[w.id](q) # L. Let summary y” = s.got-state[w.id](q). 
Thus y” € allstate|g, w.id]. Since s.estb[c.id], = true we have that state-dlv[q,c.id] # L. By 
Invariant 6.5.7, we have that y" high > c.id. By Invariant 6.5.6 we have that y".tag > t. Recall 
that y'.high = c.id and by definition y’ is a summary with maximal high. Since y" high > c.id it 
must be that y" high = c.id, and so the summary y” from q is among those with maximal high in 
s.got-state[w.id]. By the definition of p'’ as a member that maximizes the tag component, we have 
that y'.tag > y".tag, so y'.tag > t. Since by the code y.tag = y’.tag, we have that y.tag > t, as 
needed. Oo 


The next invariant states that when a completed propagate phase performed in a configuration c 
has propagate a tag ¢, all summaries whose high component is greater than c.id have a tag component 


which is greater than or equal to ¢. 


Invariant 6.5.10 (ABD-SYS) 
In any reachable state, if prop-end[i] = true, prop-tag|i] = t, c = prop-conf |i] and y € allstate is a 
summary with y.high > c.id, then y.tag > t. 


Proof: Since prop-end|i] = true, by Invariant 6.5.2 we have that c € created \ dead. 
Let W = prop-quorum|i] (this is the write quorum used in the propagate phase of operation 


i). Since prop-conf|i] = c, for all p € W, we have that estb[c.id|, = true (because process p is 
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involved in the propagate phase of operation i, so it must have established c). This implies also that 
current.id, > c.id. Moreover since prop-tag[i] = t, if a processor p € W has current.id, > c.id, 
by monotonicity of the tag, we have that buildtag|p, c.id] > t (because process p is involved in the 
propagate phase of operation i and hence knows tag ¢; when it leaves configuration c, buildtag|[p, c.id] 
must be at least t). 


By Invariant 6.5.9 we have that y.tag > t. Oo 


The next lemma states a property of any execution of ABD-Sys. Namely, if a completed propagate 
phase propagates a tag t, any subsequent query phase that totally follows the propagate phase (that 
is, begins after the propagate phase has ended), gets a tag which is greater than or equal to t. 


Lemma 6.5.11 (ABD-SyYS) 
In any execution, if the completed propagate phase of an operation i totally precedes the completed 


query phase of an operation j, then query-tag|j] > prop-tag|i] in any state where both are not L. 


Proof: Let s be any state where query-tag[j] and prop-tag[i] are not L, that is both the propagate 
phase of operation i and the query phase of operation 7 have been completed. Then we have 
s.prop-end|i] = true and s.query-end|j] = true. Clearly we also have s.query-begin|j] = true. Let 
(s', 7, 81) be the step when prop-end|i] is set to true, let (s", 7, 82) be the step when query-begin|j] 
is set to true and let (s!”, 7,83) be the step when query-end|j] is set to true. We must have that 
8, precedes s’, sq precedes s’’ and s3 precedes s in the execution. 

Let t = s.prop-tag|i]. We need to prove that s.query-tag[j] > t. 

Let W = s.prop-quorum|i] (this is the write quorum used by the propagate phase of operation 
i) and let R = s.query-quorum|j] (this is the read quorum used by the query phase of operation j). 

Let cy = s.prop-conf |i] and co = s.query-conf[j]. We have W € c,.wgrms and R € cy.rgrms. 

Since s.prop-end|i] = true and s.query-end[j] = true, by Invariant 6.5.2 we have that c,,cp € 
created \ dead in state s. 


We consider three cases: 


1. C,.id = Cy.id 


Since cy = cp we have that RNW #@. Let qe RNW. Process g submits to the condenser 
function dmaztag Of operation j a tag which is greater than or equal to t. By definition of 


dmastag We have that s.query-tag|j] > t, which gives the claim. 


2. c.id < c.id 


Let p be any process of R. By the code we have that variable high, is changed only when 
action NEwsTATE, is executed, and is set to current.id,. Since p participates in the query phase 


of operation j there must be a state § in between s’’ and s3 such that that 8’ high.id, = 
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8'.current.idy = c2.id. By monotonicity of configuration identifiers we have that s.high.id, > 


8' high.id,, and since c,.id < c2.id we have that s.high.id, > c,.id. 


Now, let yp be the summary due to the acknowledgment value sent by p for the query phase 
of operation j. Notice that the tag yp.tag is used by the condenser function dmaztag for the 


query phase of operation 7. 


By Invariant 6.5.10, applied to state s using c = c,, we have that yp.tag > t. By definition of 
dmaztag We have that query-tag[j] > t. 


3. c,.id > Co.id 


We show that this cannot happen. There are two possible cases: 


(a) Aw € s" JotEst such that cy.id < x, id < cid. 

By Invariant 6.3.1 applied to state s", there exists W' € co.wqrms such that W' C cy.set. 
We have that RNW' 49. Let pe RNW’. We have that p € c1.set. 

In state s; the propagate phase of operation i ends. It must be the case that every 
member of c, has s1.current.id > c,.id and also that every member of c; has submitted 
its state for c, prior to the beginning of the query phase of operation j. Since p € c,.set, 
there must exist a state § preceding s; such that s.current, = c1. 

Since p € R there must exist a state 8’ in between s’’ and s3 such that 8’.currentp = co. 
Since s; precedes s’’, we have that § precedes 8’. By monotonicity of configuration iden- 
tifiers we must have 8.current.id, < 8'.current.id,, that is cy.id < cg.id. This contradicts 


the hypothesis that c,.id > cg.id. 


(b) Sax € s".JotEst such that cy.id < x.id < c.id. 


Let c' be the totally established configuration with the smallest identifier intervening 


between cz and c; in s”. By definition of c’ we have that c'.id > cy.id. 
By Invariant 6.3.1 applied to state s", there exists W' € co.wgrms such that W' C c’.set. 
We have that RN W' #4 @. Let pe RNW'. Since c' € s".JotEst and p € c'.set we have 


that s".current.id, > cid. 
Since p € R, there must exist a state § between s” and s3 such that 8.current, = co. 


By monotonicity of configuration identifiers, since s" precedes § we have that §.current.id, > 


s" current.id,. This implies that c2.id > c'.id > cg.id, which is impossible. 


In order to prove that the system implements an atomic object we use the following lemma from 


[65] (Lemma 13.16, page 435). 
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Lemma 6.5.12 Let 8 be a (finite or infinite) sequence of actions of a read/write atomic object 
external interface. Suppose that 6 is well-formed, and contains no incomplete operations. Let II be 
the set of all operations in £. 

Suppose that < is an irreflexive partial ordering of all the operations in II, satisfying the following 


properties: 
1, For any operation i € II, there are only finitely many operations j such that j X i. 


2. If the response event for operation i precedes the invocation event for operation j in 2B, then it 


cannot be the case that j <1. 
3. If i is a write operation in II and j is any operation in II, then either i ~ j or j <1. 


4. The value returned by each read operation is the value written by the last preceding write 


operation according to < (or a fixed initial value, if there is no such write). 


Then 6 satisfies the atomicity property. 


We can use the above lemma to prove the following result. By Lemma 13.10 of [65] (page 419) 


we can restrict our attention to executions with no incomplete operations 
Theorem 6.5.13 ABD-SYS implements an atomic read/write object. 


Proof: In order to show that the system implements an atomic object we need to provide a partial 
order that satisfies Lemma 6.5.12. Let use define the order ~ as follows. First define the tag of an 
operation i as tag(i) = prop-tag|i], that is, the tag written in the propagate phase. Order all writes 
operation in order of tag and place each read operation after the write operation with the same tag 
and before any other write operation (order of read operations in between two consecutive write 
operations is irrelevant). Place all reads for which there is no write operation with the same tag, 
before the first write operation. 

Next we prove that ~ satisfies Lemma 6.5.12. Let us start with Point 1. Any operation j < i must 
have tag(j) < tag(z). The number of write operations that precede i is bounded by the number of 
tags which are strictly smaller than tag(i). This is a finite number. The number of reads which have 
a tag smaller than tag(i) is bounded by the number of read operations completed before operation 
iis completed. This is also a finite number. 

Now consider Point 2. Assume that the response event for an operation i precedes the invocation 
event for an operation j. Then we have that the propagate phase of operation i precedes the query 
phase of operation j and by Invariant 6.5.11 we have that the tag returned by the query phase 
of operation j is greater than or equal to the tag written by the propagate phase of operation i. 
Since the latter is equal to tag(é) and since the former is less or equal than tag(j) we have that 


tag(t) < tag(j). Thus it cannot be that j ~ i because this means that tag(j) < tag(i). 
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Consider now Point 3. Assume that i is a write operation and that j is any other operation. 
Assume by contradiction than neither i ~ j nor j < i. Then we have that tag(i) = tag(j) and 
that j is also a write operation (a read operation with the same tag of é is such that i < j). Since 
tag(i) = tag(j) and since the process identifier is part of the tag, it must be the case that both 
operations are requested by the same process. Hence it must be the case that one of the operation, 
say i, is completed before the other, operation j, is requested. Thus the (completed) propagate 
phase of operation i precedes the query phase of operation 7. Hence by Invariant 6.5.11 we have 
that the tag returned by the query phase of operation j is greater than or equal to the tag written 
by the propagate phase of operation i. The latter is equal to tag(t) and the former, by the code, 
is strictly less than tag(j). Hence we have that tag(i) < tag(j), which contradicts the fact that 
tag(i) = tag(J). 

Finally consider Point 4. Since each read is ordered right after the write with the same tag it 
is enough to show that a read operation i gets the value written by a write operation j such that 
tag(j) = tag(z). So let ¢ be a read operation and let j be a write operation with tag(j) = tag(). It 
follows by the code that the value returned by operation i is the one written by operation j (because 


tag and value are updated simultaneously). 0 


6.6 Remarks 


We remark that the intersection property of DC, namely that there exist a read quorum RF and a write 
quorum W of a previous primary configuration both belonging to the next primary configuration 
comes from the particular application that we have developed. For other applications one might 
have different (maybe weaker) intersection properties. For example, one might require that the 
new primary configuration contains a read quorum of the previous one (and not necessarily a write 
quorum). In our case, we must require both a read quorum and a write quorum in the new primary. 
If we do not require a read quorum to be in the new configuration but only require a write quorum 
to be in the new configuration, since write quorums may not intersect, two non-intersecting write 
quorums might concurrently proceed to two primary configurations, violating the uniqueness of a 
primary configuration. The same situation can happen if we do not require a write quorum to 
be in the new configuration but only a read quorum to be in the new configuration, because two 
read quorums may not intersect. In this latter case it is also possible for a read quorum in an old 
configuration to read obsolete values; indeed processes in a read quorum can be left behind if newer 
configurations are established but since they form a read quorum of the configuration they are in, 
they will be able to read whatever (obsolete) value they have. 

It is possible to optimize the state transfer at the beginning of a new configuration. The goal of 


the state transfer is to obtain all information from previous configuration. Clearly process that join 
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the system have no information about previous configurations. Hence it is useless to wait for them 
to submit their state before computing the new up to date state. 

We remark that the choice of integrating the state transfer into the service has been made because 
most applications have to perform such state exchange and thus it seems reasonable to do it within 
the service in order to free the application from the details of such a computation. We did not 
change the DVS service to also offer integrated state exchange because some applications may not 
require submission of the current state from every member of a new view or configuration. So it 
may be useful also to leave to the application control of the state exchange computation. 

The above remark is connected to the question: does DC supersede Dvs? On one hand DC is 
more general than Dvs because it provides a group communication service that handles configurations 
instead of views and configurations carry more information than views. On the other hand there are 
some differences between DC and Dvs. We already talked about the difference in the state exchange 
mechanism. Another difference is in the communication mechanism used by the two services: DVS 
uses a point-to-point communication mechanism, while DC use a broadcast /convergecast mechanism 
involving a quorum of processes. Because of these differences we have kept the two services as 
different services. 

The DC service requires every process of a new configuration to submit its state. This is a strong 
requirement for applications that use quorums to improve availability. However it provides a strong 
service. It is possible to specify a weaker version of the DC service that requires only a read quorum 
to submit the state before computing the starting state of a new configuration. We believe that the 
TO algorithm we have developed in this chapter would still work with this weaker service. 

The ABD algorithm does not use the prefix property of message delivery guaranteed by the Cs 
specification. Hence one could use a weaker specification instead of CS to implement DC. 

The implementation of DC performs garbage collection when a view becomes totally established 
(any previous view is discarded and no intersection checks are made with these discarded views). 
Yeger Lotem et al. [89] perform garbage collection when a view becomes established (the process 
that establishes a view discards all previous ambiguous views). Our garbage collection mechanism, 


though less efficient than that of [89], allows an easier proof of correctness. 
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Chapter 7 


Dynamic Algorithms 


In this chapter we apply the ideas about dynamic configurations developed in Chapter 6 to design 
a dynamic version of the PAXos algorithm [61], called Dpaxos, and a dynamic primary copy data 
replication algorithm implementing an atomic object, called RAB. 

Both algorithms are built upon an underlying group communication service; this service, called 
DLC, is a variation of the DC service (see Chapter 6) which uses “leader configurations” (see Chap- 


ter 3) and augments the service with point-to-point communication. 


We start the chapter with the DLC service. Section 7.2 provides the DPAXOS algorithm. In 


Section 7.3 we sketch the RAB algorithm. Finally Section 7.4 contains concluding remarks. 


7.1 The DLC specification 


In this section we provide a dynamic primary configuration group communication service. This 
service, which we call DLC, is similar to the DC service; the differences are: (¢) DLC handles leader 
configurations instead of read-write quorum configurations (see Chapter 3) and (éi) DLC provides 
point-to-point channels too. 

The DLC service is similar to the DC service. The code is provided in Figure 7-1. In Section 7.1.1 
we explain the differences between DLC and DC. We also provide a full description of DLC in 
Section 7.1.2; however the reader who comes from Chapter 6 and reads Section 7.1.1 can safely skip 


Section 7.1.2. 


7.1.1 Differences with Dc 


There are basically two differences between DC and DLC. The first difference is due to the kind of 
configurations considered: leader configurations instead of read-write quorum configurations. As 


a result, the key intersection property becomes the following: for any two created and non dead 
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DLC 


Signature: 


Input: SUBMIT(m, ¢,1)p, MEM, GE, pE Pi 
ACKDLVR(a,1)p,a € A,i€ OID, pEP 
SUBMIT-STATE(S,)p,5E€S,$E€0,pEP 
P2P-RECV(M)g,p, MEM, g,p EP 


Internal: CREATECONF(c), c € C 


State: 
created € 2°, init {co} 
for each p € P: 
cur-cid|p] € G_, init go if p € Po, L else 
for each p,g€ P,g EG: 
p2p-msgs\p, 9, q|segof (M), initially 0 
p2p-nest(p, 9,q¢| € N>°, initially 1 


Derived variables: 


OID, 


Output: NEWCONF(c)p, c€ C, p € c.set 
NEWSTATE(S)p, SE S,p EP 
RESPOND(a,1)p,a€ A, i € OIDz, p EP 
DELIVER(m,i)p, m € M,i€ OID, pEP 
P2P-SEND(M)p,q, Mm € M, a,p € P, 


for each g € G: 
got-state|g] = P — S_, init everywhere L 
condenser|g| = P — ® 1, init everywhere L 
state-dlu|[g] € 2”, init Po if g = go, {} else 
pending|g] € O, init everywhere L 
attempted[g] € 2”, init Po if g = go, {} else 


Att € 2°, defined as {c € created|attempted|c.id] 4 0} 


Est € 2°, defined as {c € created|state-dlu[c.id] 4 0} 


Tot Att € 2°, defined as {c € created|c.set C attempted |c.id]} 
TotEst € 2°, defined as {c € created|c.set C state-dlv[c.id]} 


dead € 2© defined as dead = {c € C|Ap € c.set : cur-cidp > c.id and p ¢ attempted[c.id]}. 


Actions: 
internal CREATECONF(c) 
Pre: Vw € created : c.id 4 w.id 
if c ¢ dead then 
Vw € created, w.id < c.id: 
w € dead or 
(Ag € JotEst: w.id < x.id < c.id) or 
(AQ € w.qrms: Q C c.set) 
Vw € created, w.id > c.id 
w € dead or 
(da € TotEst: cid < x.id < w.id) or 
(AQ € c.qrms: Q C w.set) 
Eff: created := created U {c} 


output NEWCONF(c)p, p € c.set 
Pre: ¢ € created 
c.id > cur-cid[p] 
Eff: cur-cid[p] := c.id 
attempted[c.id] := attempted[c.id] U {p} 


input SUBMIT-STATE(s, ¢)p 
Eff: if cur-cid[p] # L and 
got-state|cur-cid[p]|(p) = L then 
got-state|cur-cid[p]|(p) := s 
condenser|cur-cid[p]](p) := ¢ 


output NEWSTATE(s), choose c 

Pre: c.id = cur-cid[p] 
c € created 
Vq € c.set, got-state[c.id](q) A L 
let f = condenser|c.id|(p)|c.set 
s = f(got-state[c.id]) 
p ¢ state-dlv[c.id] 

Eff: state-dlv[c.id] := state-dlu[c.id] U {p} 


Figure 7-1: 
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input SUBMIT(m, ¢,1)p 
Eff: if cur-cid[p] A L then 
pending|cur-cid[p]]|(2) 
:= (m,¢,9, A(z) : 2 > L, false) 


output DELIVER(m,i)p choose g 
Pre: g = cur-cid[p] 
p & pending[g|(i).dlv 
pending|[g](i).msg =m 
Eff: pending[g](i).dlu := pending|[g](i).dlu U {p} 


input ACKDLVR(a,1)p 
Eff: if cur-cid[p] # L and 
pending|[cur-cid|p]|(2).ack(p) A L then 
pending|[cur-cid[p]](2).ack(p) := 


output RESPOND(r,1), choose c,Q 
Pre: c.id = cur-cid[p] 
c € created 
i € OID» 
pending|[c.id](i).rsp = false 
Q € c.qrms 
let f = pending|c.id](i).ack 
Va@eEQ: fq)AL 
r = pending|c.id](i).cnd(f|Q) 
Eff: pending|c.id](i).rsp := true 


input P2P-SEND(m) p,q 
Eff: if cur-cid[p] # L then 
append m to p2p-msgs|[p, cur-cid|[p], q] 


output P2P-RECV(™M)p,q, Choose g 
Pre: g = cur-cid|q] 
mp2p-msgs|p, 9, q|(p2p-neat|p, g, q]) 
Eff: p2p-nezt[p, 9,q] := p2p-neat[p,g,q] + 1 


The DLC specification 


configurations c; and co, with c1.id < ce.id, either there exists an intervening totally established 
configuration or a quorum of c; is included in the membership set of cg. Invariant 7.1.1 formalizes 
the above key property (this invariant is given in Section 7.1.3). 

The second difference is that DLC offers also point-to-point communication. That is, a process p 
can send a message m to another process g, provided that both p and gq are in the same configura- 
tion. Actions P2P-sEND(m),,¢ and P2P-RECV(m)p,q Of DLC implement the point-to-point communication 
mechanism (this portion of the code is not present in DC). 


The rest of the DLC specification is the same as the DC specification. 


7.1.2 Full description of DLC 


In this section we provide a full description of DLC. The reader who comes from Chapter 6 and 
has read Section 7.1.1 can safely skip this section. The description we provide here is similar to the 
description of the DC service provided in Section 6.2. 

Prior to providing the code for the DC specification, we need some notation and definitions, which 
we introduce in the following. 

Let OID be a set of operation identifiers, partitioned into sets OID,, p € P. We denote by 
M. CM the set of messages that clients may use for communication. 

Let A be a set of “acknowledgment” values and let R be a set of “response” values. A condenser 
function is a function from (P > A) to R. Let ® be the set of all condenser functions. Let S 
be the set of all possible states of the clients (a state of S does not need to be the entire client’s 
state, but it may contain only the relevant information in order for the application to work). The 
DC specification uses a condenser function also to compute the starting state of a new configuration; 
hence we assume that S C A and also S C R. Given a function f :P — D from the set of processes 
P to some domain D and given a subset P C P, we write f|P to denote the function f’ : P > D, 
defined as f'(p) = f(p) for p € P. 

The following data type is used to describe operations: D = M x ® x 2? x (P > AL) x Bool 
and we let O = OID > D_. Given an operation descriptor, selectors for the components are msg, 
cnd, dlv, ack, and rsp. 

Next we provide remarks and an informal description of this code. We start with the derived 
variables. 

A configuration c € Att is said to be attempted. For an attempted configuration c there exists at 
least one process p that has executed action newconr(c)p and thus we have that p € attempted[c.id]; 
when this holds we say that c is attempted at p or that p has attempted c. A configuration c € Jot Att 
is said to be totally attempted. A totally attempted configuration is a configuration that is attempted 
at all members of the configuration. 


A configuration c € Est is said to be established. For an established configuration c there exists at 
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least one process p that has executed action newstatx(s), and thus we have that p € state-dlu[c.id]; 
when this holds we say that c is established at p or that p has established c. A configuration 
c € Jotést is said to be totally established. A totally established configuration is a configuration that 
is established at all members of the configuration. 

A dead configuration c is a configuration for which a member process p went on to newer config- 
urations, that is, it executed action NewconF(c’)» with c'.id > c.id, before receiving the notification, 
that is the newconrF(c), event, for configuration c. 

Now we comment on the transitions. 

Action creaTEconF(c) creates a new configuration c. The first precondition simply requires this 
new configuration to have a brand new identifier. The second precondition of this action is the key 
to our specification. It states that when a configuration c is created it must either be already dead or 
for any other configuration w such that there are no intervening totally established configurations, 
the earlier configuration (i.e., the one with smaller identifier) has at least one quorum included in 
the membership set of the later configuration (i.e., the one with bigger identifier). 

Action newconr(c), delivers a created configuration c to the client process p. The precondition 
of this action makes sure that configurations are delivered in order of configuration identifiers. We 
notice that because of this precondition, when a configuration c is dead because a process g went on 
to newer configurations, we have that process g can no longer execute action NEWconF(c)q. 

Once a configuration c has been delivered to a client process p, the client process p is supposed 
to submit its current state s and a condenser function ¢, by means of action supmrr-sTaTE(s,¢)p- Once 
all the processes have submitted their current states, the condenser function ¢ is used to compute 
the starting state of configuration c for process p. The code of this action just memorizes the state 
s and the condenser function ¢ for the current configuration of process p. 

Action Newstare(s), computes the starting state for a configuration c. The precondition of this 
action requires that all processes g in the membership of configuration c have submitted their state 
for configuration c. The starting state s of configuration c for process p is then computed by applying 
the condenser function that process p has submitted to the service with action supmit-sTaTE(s, ¢)p- 
Variable state-dlu|c.id] records the fact that p has received the starting state for configuration c. 

We remark that for a dead configuration c there is at least one process that does not execute 
action NewconF(c), and thus does not submit its state for c with action susmrr-sTaTE(s,¢)p. This implies 
that action NewstatTE(s), cannot be executed for any process g. This is why such configurations are 
called “dead”. 

The remaining actions are used to handle the requests of clients. We refer to the process of 
handling such a request, which involves the participation of a quorum of processes, as an “operation” . 
To request the execution of an operation a client process p uses action susmitT(m, ¢,i)p. The parameters 


of this actions are as follows: m is a message describing the operation that p needs to perform; ¢ is 
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a condenser function to be used to compute a response value for p when a quorum of processes have 
provided acknowledgment values to p’s message m; 7 is an operation identifier needed to distinguish 
operations (every requested operation has a unique operation identifier). We say “operation i” to 
indicate the operation requested with action susmrt(m,¢,i)p. For configuration c and operation 3, 
the variable pending[c.id](i) contains an operation descriptor. The code of action susmit(m, ¢,i)p sets 
this operation descriptor to a default value. 

We now provide an explanation for each component of an operation descriptor. Let d be an 
operation descriptor for operation i requested by p in configuration c. d.msg is the message that 
describes the request of p; such a message will be delivered to all members of the configuration c. 
d.cnd is the condenser function that will be used to compute the response for the operation once 
a quorum of processes has provided acknowledgment values. d.dlu is the set of processes to which 
the message d.msg has been delivered; initially this is set to an empty set by action susmit(m, 9,1)». 
d.ack contains the acknowledgment values received; initially this is a vector of L values. Finally, 
d.rsp is a flag indicating whether or not the client p that requested the operation has received a 
response for the operation. 

Action DELIVER(m,i)p delivers the message m of operation 7 to process p. The code of this action 
updates the operation descriptor d for operation i by adding process p to the set d.dlv. 

Processes that receive the message m for an operation 7 are supposed to provide an acknowledg- 
ment value a with action acxpivR(a,i)p. The code of this action records the acknowledgment value a 
of process p into the vector d.ack, where d is the operation descriptor for operation 7. 

Action REsPoNnD(r,7) provides a response r to process p for the operation i previously submitted by 
p. The precondition of this action requires that a quorum Q has provided acknowledgment values. 
Then the value r is computed by applying the condenser function provided by p at the time of the 
submission, to the acknowledgment values of processes in Q. At this point the operation has been 
serviced and the rsp component is set to true. 

The code that handles point-to-point communication, is provided by actions p2p-sEND(m) p,q and 
P2P-RECV(m)p,q- State variable p2p-msgs|[p, g, q] is used to record the messages sent. When action p2p- 
SEND(m)p,q is executed in a configuration whose identifier is g, message m is added to p2p-msgqs|p, g, q| 
which contains a queue of messages sent by p to g in configuration g. Action p2p-RECV(m)q,» delivers 


message m to q. 


7.1.3. Invariant of DLC 


In this section we provide a key invariant of DLC. The property stated by this invariant is used to 
prove correct the applications that we build on top of DLC. 
The second precondition of creareconr(c) is the key to our specification. It states that when a 


configuration c is created it must either be already dead or for any other configuration w such that 
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there are no intervening totally established configurations, the earlier configuration (i.e., the one 
with smaller identifier) has one quorum whose members are included in the membership set of the 
later configuration (i.e., the one with bigger identifier). The above precondition enables us to prove 


the following key invariant: 


Invariant 7.1.1 In any reachable state of DLC, the following is true. Let c1,c2 € created \ dead, 
with cy.id < cy.id. Then either there exists w € JotEst,c,.id < w.id < co.id, or else there exists a 


quorum Q € c1.qrms such that Q C co.set. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
are no two configurations ¢1,c2 € created such that c1.id < cy.id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s’ for any step (s,7,s’). The only action that we need to worry about is 
CREATECONF(c), where c = C1 or c = C2, because it creates a new configuration (otherwise the invariant 
is true in s’ by the inductive hypothesis). So assume that 7 =createconr(c). The invariant follows 


easily from the precondition of 7. 0 


7.2. Dynamic PAXOSs algorithm 


In this section we present DPAXOS, an algorithm that solves the consensus problem and that adapts 
well to dynamic distributed systems. The DPAXOS algorithm is derived from Lamport’s PAXoS 
algorithm because it uses the same strategy. We refer the reader to the paper by Lamport [61] for a 
complete and colorful presentation of the PAXOS algorithm; the papers [29, 30] provide a less colorful 
description of the PAxos algorithm using the I/O automaton model. For the sake of completeness, 
in this section we provide a brief description of the PAXOS algorithm. 

We first provide a formal definition of the consensus problem in Section 7.2.1. Then in Sec- 
tion 7.2.2 we recall the original PAXOS algorithm. Section 7.2.3 provides the DPAXOS algorithm and 


Section 7.2.4 the proof of correctness. 


7.2.1 Distributed Consensus 


Distributed consensus is a fundamental problem in distributed computation. Roughly speaking 
the problem is the one of reaching agreement among the members of a distributed system; such 
agreement is often necessary for distributed applications (e.g., data replication, airline reservation 
system, distributed transactions). 

Next we give a formal definition of the consensus problem that we consider in this paper. Process 


p starts the computation with an input value v, € VY, where ¥ is the set of all possible initial values. 
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Given a particular execution a, we denote by X%, C & the set of the initial values of processes in a. 
Each process has a state variable called decision which is a write-once variable initially not 
written. Processes have to write their decision variable in such a way that three conditions are 


satisfied: 
e Agreement: All the written decision variables are set to the same value. 
e Validity: Any written decision variable is set to a value in Yq. 


Since the termination condition is a liveness issue and we don’t address liveness, we don’t provide 
a formal definition. Informally, the termination condition requires that if in a given state s a 
configuration c is totally established, c is the configuration with the biggest identifier in s.created, 
no other configurations are created after s, all processes of c are alive in state s, and no failures 
happen after s, then all processes that are members of c eventually write the decision variable, 


provided that the execution is a fair execution. 


7.2.2 The original PAXOS algorithm 


The PAXOS algorithm does not use a group communication service, so there are no views or configu- 
rations; it relies on an external leader elector module which provides (unreliable) information about 
the current status of the underlying distributed system, i-.e., tells the current membership and the 
leader, to each process. Termination is guaranteed only when there are no failures, and the external 
leader elector provides reliable information on the system, for a sufficiently long time. 

The basic idea is to have processes propose values until one of them is accepted by a majority of 
the processes; that value is the final output value. Any process may propose a value by initiating a 
round for that value. The process initiating a round is said to be the leader of that round while all 
processes, including the leader itself, are said to be agents for that round. 

Since different rounds may be carried out concurrently (the leader elector is not reliable hence 
several processes may concurrently consider themselves leaders), we need to distinguish them. Every 
round has a unique identifier. Next we formally define these round identifiers. A round number is a 
pair (x,i) where x is a nonnegative integer and i is a process identifier. The set of round numbers 
is denoted by R. A total order on elements of R is defined by (x, i) < (y,j) iff x < y or, x = y and 
i<j. 

We say that round r precedes round r’ if r < r'. If round r precedes round r’ then we also say 
that r is a previous round, with respect to round r’. We remark that the ordering of rounds is not 
related to the actual time the rounds are conducted. It is possible that a round 1’ is started at some 
point in time and a previous round r, that is, one with r < r', is started later on. 

Every round in the algorithm is tagged with a unique round number. Every message sent by the 


leader or by an agent for a round (with round number) r € R, carries the round number r so that 
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no confusion among messages belonging to different rounds is possible. 


Informally, the steps for a round are the following. 


1. To initiate a round, the leader sends a “Collect” message to all agents announcing that it 
wants to start a new round with round number r and at the same time asking for information 


about previous rounds in which agents may have been involved. 


2. An agent that receives a message sent in step 1 from the leader of the round, responds with 
a “Last” message giving its own information about rounds previously conducted, namely the 
last round r’ for which the agent made a commitment and the value v of that round. With 
this, the agent makes a kind of commitment for this particular round that may prevent it from 
accepting (in step 4) the value proposed in some other round. If the agent is already committed 
for a round with a bigger round number then it informs the leader of its commitment with an 


“QldRound” message. 


3. Once the leader has gathered information about previous rounds from a majority of agents, it 
decides, according to some rules, the value to propose for its round and sends to all agents a 
“Begin” message announcing the value v for round r and asking them to accept it. In order 
for the leader to be able to choose a value for the round it is necessary that initial values 
be provided. If no initial value is provided, the leader must wait for an initial value before 
proceeding with step 3. The set of processes from which the leader gathers information is 


called the info-quorum of the round. 


4. An agent that receives a message from the leader of the round sent in step 3, responds with 
an “Accept” message by accepting the value proposed in the current round r, unless it is 
committed for a later round and thus must reject the value proposed in the current round. In 
the latter case the agent sends an “OldRound” message to the leader indicating the round r’ 


for which it is committed. 


5. If the leader gets “Accept” messages from a majority of agents, then the leader sets its own 
output value to the value proposed in the round. At this point the round is successful. The 


set of agents that accept the value proposed by the leader is called the accepting-quorum. 


Since a successful round implies that the leader of the round reaches a decision, after a successful 
round the leader still needs to do something, namely to broadcast the decision. Thus, once the 
leader has made a decision it broadcasts a “Success” message announcing the value for which it has 
decided. An agent that receives a “Success” message from the leader makes its decision choosing 
the value of the successful round. We use also an “Ack” message sent from the agent to the leader, 


so that the leader can make sure that everyone knows the outcome. 
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The most important issue is about the values that leaders propose for their rounds. Indeed, 
since the value of a successful round is the output value of some processes, we must guarantee 
that the values of successful rounds are all equal in order to satisfy the agreement condition of 
the consensus problem. This is the tricky part of the algorithm and basically all the difficulties 
derive from solving this problem. Consistency is guaranteed by choosing the values of new rounds 
exploiting the information about previous rounds from at least a majority of the agents so that, for 
any two rounds, there is at least one process that participated in both rounds. 

In more detail, the leader of a round chooses the value for the round in the following way. In 
step 1, the leader asks for information and in step 2 an agent responds with the number of the latest 
round in which it accepted the value and with the accepted value or with round number (0,7) and 
if the agent has not yet accepted a value. Once the leader gets such information from a majority 
of the agents (which is the info-quorum of the round), it chooses the value for its round to be equal 
to the value of the latest round among all those it has heard from the agents in the info-quorum or 
equal to its initial value if all agents in the info-quorum were not involved in any previous round. 
Moreover, in order to keep consistency, if an agent tells the leader of a round r that the last round in 
which it accepted a value is round r’, r' < r, then implicitly the agent commits itself not to accept 
any value proposed in any other round r", r' <r" <r. 

Given the above setting, if r’ is the round from which the leader of round r gets the value for 
its round, then, when a value for round r has been chosen, any round r”, r! <r” < r, cannot be 
successful; indeed at least a majority of the processes are committed for round r, which implies 
that at least a majority of the processes are rejecting round r’. This, along with the fact that 
info-quorums and accepting-quorums are majorities, implies that if a round r is successful, then 
any round with a bigger round number 7 > r is for the same value. Indeed the information sent 
by processes in the info-quorum of round 7 is used to choose the value for the round, but since 
info-quorums and accepting-quorums share at least one process, at least one of the processes in the 
info-quorum of round r’ is also in the accepting-quorum of round r. Indeed, since the round is 
successful, the accepting-quorum is a majority. This implies that the value of any round * > r must 
be equal to the value of round r, which, in turn, implies agreement. 

Instead of majorities for info-quorums and accepting-quorums, any quorum system can be used 
(DPAXOS uses the quorum system of the configuration). Indeed the only property that is required is 
that there be a process in the intersection of any info-quorum with any accepting-quorum. 


To end up with a decision value, rounds must be started until at least one is successful. 


7.2.3. The DPAXOS algorithm 


The DPAXOS algorithm borrows the basic ideas of the PAXOS algorithm, but it is built upon the 


DLC group communication service. Thus it exploits the properties guaranteed by such a service. A 
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round of the DPAXOS algorithm is similar to that of the PAXOS algorithm with the following major 
differences: (i) since any time that the leader changes the DLC service provides a new configuration, 
we have that at most one round is conducted in each configuration (hence we do not distinguish 
rounds and configurations); (ii) the first part of a round, whose purpose is to find a value that 
the leader proposes in the round, is no longer necessary, because the group communication service 
provides, with the starting state of a new configuration, a value that can be safely proposed by the 
leader. Thus in DPAXOS the leader needs only to send “Begin” messages and collect the “Accept” 
messages. 

Because of (4) we no longer need to worry about processes committing to reject rounds or to 
make sure that messages belonging to different rounds do not interfere: by the properties of the DLC 
group communication service we have that processes receive and send messages only in the current 
configuration. Since in each configuration only one round is conducted, no interferences are possible 
and older rounds are automatically rejected. Configuration identifiers can be used as round numbers 
(and viceversa). Because of (ii) a round of DPAXOsS is shorter than a round of PAXOs (the first part 
of the round is basically done while establishing the new configuration). 

As a result of the above, the code of DPAXOS is simpler and much shorter than the code of PAXOS 
as implemented in [29, 30]. 

Since in each configuration only one round is run, round numbers in DPAXOS are configuration 
identifiers. Hence the set of round numbers is R = G. We say that round r precedes round r' if 
r <r’. If round r precedes round r’ then we also say that r is a previous round, with respect to 


round 7’. 


For the DPAXOS algorithm, the set S consists of pairs (r,v), where r € G and v € &. 

The code, DLC-TO-PAXOS,, is shown in Figure 7-2. The overall system DPAXOS consists of the 
composition of DLC and automaton DLC-TO-PAXOS, for each p € P. 

Next we provide an informal description of the code. 

We start by describing the state variables. Variable current, contains the current configuration 
for process p; if process p runs a round, then the round number is given by current.id,. Variable 
rnd-val, contains the value that process p proposes in the current round. Variable decision, contains 
the decision of process p. Variable used-ids, C OID, is a set of identifiers used to distinguish 
operations that process p submits to the DLC service. Variable last-rnd, contains the last round for 
which process p has accepted the value. Variable last-val contains the value of round last-rnd. 

The DLC-TO-PAXOS, has a reconfiguration phase, which starts when process p is notified of a new 
configuration and ends when process p is given the starting state for the new configuration. When 
not in reconfiguration phase process p performs normal processing. Variable status, is used to switch 
from reconfiguration phase to normal processing. For normal processing we have status, = normal. 


Variable mode, is used by the leader to go through the steps of a round. 
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DLC-TO-PAXOS 
Signature: 


Input: NEWCONF(c)p, c € C, p € c.set 
NEWSTATE(S)p, SE S,p EP 
DELIVER(m,1%)p, m € M,i€ OID, pE P 
RESPOND(a,i)p, a€ A, i € OIDy, pEP 
P2P-RECV(M)g,p, ME M,g,p EP 

State: 

current € C, init co if p € Po, else L 

rnd-val € &, initially vp 

decision € *, initially L 

used-ids C OID, initially 0 

ack-q, seqof(A x OID), init » 

ack-d, either “ack” or L, init L 


Actions: 


input NEWCONF(c)p 
Eff: current := c 
status := exch-ready 
ack-q := X 
if p= c.ldr and mode + decided then 
mode := newround 


output SUBMIT-STATE((r, v), Pstate)p 
Pre: status = each-ready 
r = last-rnd 
v = last-val 
Eff: status := exch-waitt 


input NEWSTATE((r, v))p 
Eff: status := normal 
last-rnd := current.id 
last-val := v 
rnd-value := v 
Hfrom(current.id) = r 
Hvalue(current.id) = v 


internal NEWROUND, 
Pre: mode = newround 
status = normal 
Eff: mode := begincast 


output SUBMIT((“begin”, v), bsegin» 4) p 
Pre: current # L 
i € used-ids 
status = normal 
v = rnd-value 
mode = begincast 
Eff: used-ids := used-ids U {i} 
mode := wait 


Internal: NEWROUND,, p € P 
Output: sUBMIT(m,¢,i)p, MEM, 6E€ ®,pEP,i 


ACKDLVR(a,1)p,a€ A,i€ OID, pEP 
SUBMIT-STATE(S,¢)p, SE S,PE€0,pEP 
P2P-SEND(m)p,q,m € M, q,p € P 


last-rnd € G, initially go if p € Po, else L 

last-val € %&, initially vp 

status € {normal, exch-ready, exch-wait}, init normal 

mode € {wait, newround, collect, begincast, decided}, 
init newround for p = co.ldr, wait else 

ack-success(q), a boolean, initially false for all gE P 


input DELIVER((“begin”,v), 7) p 
Eff: append ((“accept”),i) to ack-q 
last-rnd := current.id 
last-val := v 


output ACKDLVR(a,i)p 
Pre: head(ack-q) = (a,i) 
Eff: ack-q := tail(ack-q) 


input RESPOND((“begin”, Q),i)p 
Eff: decision := rnd-value 
mode := decided 
Haccquo(current.id) := Q 


output P2P-SEND(V)p.¢ 
Pre: mode = decided 
v = decision 
ack-success(q) = false 
Eff: none 


input P2P-RECV(U)q,p 
Eff: decision := v 
mode := decided 
ack-d := “ack” 


output P2P-SEND(“ack”)p,q 
Pre: ack-d = “ack” 
Eff: ack-d:= L 


input P2P-RECV(“ack” )q,p 
Eff: ack-success(q) := true 


Figure 7-2: The DLC-TO-PAXOS code. 
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OID» 


Variable ack-q, is a queue of acknowledgment values to be sent to the DLC service. Variable 
ack-d, is used to send back to the leader an acknowledgment for the decision. Variable success-ackp 


is used by the leader to record those processes that have sent an acknowledgment for the decision. 


Next we describe the transitions. We start with the transitions for the reconfiguration phase. 

Action NEwconF(c), notifies process p of a new configuration c. Process p enters the reconfiguration 
phase by changing its status to exch-ready. It also resets queue ack-qg in order to stop sending 
acknowledgments for older rounds. Moreover if process p is the leader of the new configuration and 
it has not yet reached a decision, then it prepares itself, by setting mode, to newround, to start a 
new round when the normal processing mode will be re-entered. 

With action supmit-stTaTE((r,v), dstate)p Process p submits its current state to the DLC service. 
The relevant information submitted to the service consists of the last round r for which process 
p has accepted a value (i.e., has sent an “accept” message) and the value v of that round. The 
condenser function @s¢ate collects all the states submitted by processes in the current configuration 
and computes the value to propose in the current round. Formally it is a function that takes a set 
of pairs W = {(r,v)|r € G,v € X} and returns a pair (r’,v’) € W where r' is such that r’ > r for 
all (r,v) € W. At this point process p has to wait for the DLC service to deliver the starting state. 
Hence it sets status, to exch-wait. 

Action newstatTE((r,v))p delivers to process p the starting state for p’s current configuration. This 
starting state contains the value v to propose in the round for the current configuration. Normal 


processing is re-entered by setting status, = normal. 


Next we describe the transitions for normal processing, where rounds are run. 

Action Newrounp, starts a new round; in order to do this, the leader of the current configuration, 
say process p, must be ready to start a new round, that is mode, must be equal to newround. The 
effect is just to change the mode, to begincast. Process p is now ready to send a begin message for 
the current round. 

Action susmit((“begin” ,v), dbegins4)p takes care of sending the begin message through the DLC service. 
The condenser function ¢pegin is a function that takes a set of acknowledgment values {(“accept” , i) qlq € 
Q} for some set of processes Q C P and returns (“begin”,Q). After the execution of this action 
process p has mode, = wait because it needs to wait for acknowledgment values. 

Action DELIVER((“begin” ,v), i)» is an input action from the DLC service, which delivers the begin mes- 
sage from another process. The effect of this action is to put the acknowledgment value (“accept” , i) 
into the queue ack-g,, from which it will be sent back to the DLC service by action ackpivR(a, i)». 

Action RESPOND((“begin”, Q),i)p is an input action from the DLC service which provides the response 
to the “begin” message previously sent with action suBmiT({“begin”,v), dveginst)p- This action tells the 
leader that processes in the quorum Q have accepted the current round. At this point the leader 


can make a decision and thus sets the decision, variable to the value rnd-value, proposed in the 
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current round. 

The remaining actions are used to spread a decision to all members of the current configuration 
once the leader has reached a decision. Action pP2p-sEND(v)p,¢ is used by the leader p to send the 
decision v = decision, to a process gq that does not know yet the decision. Action p2P-RECV(v)p,q is 
executed by process q when it receives a decision v from the leader p; process q sets its decision, 
variable and sets its mode, to decided. Then, process q sends an acknowledgment back to the 
leader with action P2p-sEnD(“ack”)q,) and this acknowledgment is received by the leader with action 


P2P-RECV( “ack” )q,p- 


We augment the code with the following history variables: 


e Hvalue(r), € X UL, initially v, for r = go and p = co.ldr and 1 elsewhere. This variable 


records the value for round/configuration r. 


e Hfrom(r)p € RU LL, initially L for all r,p. This variable records the round/configuration from 
which Hvalue(r) is taken. 


e Haccquo(r),, a subset of P or L, initially L for all r,p. This variable records the accepting- 


quorum of round r. 


We conclude with a few remarks. Submitting the state for a new configuration corresponds 
to sending a “Last” message in the original PAXOS algorithm. Notice that there is no need to 
commit to reject older rounds, because this is automatically guaranteed by the configuration oriented 
communication of the DLC service. Submitting the begin message to the DLC service corresponds to 
broadcasting a “Begin” message in the original PAXOS algorithm. Sending the acknowledgment value 
(“accept” ,i) for a “begin” operation, corresponds to sending an “Accept” message in the original 
PAXOS algorithm. 

Finally we remark that the DLC service encapsulates in the state-exchange mechanism part of 
the PAXOS algorithm. This is done by means of the condenser function ¢staze which computes the 
value of the latest configuration for which processes member of the new configuration have accepted 


a value. This computation is a key point in the original PAXxos algorithm. 


7.2.4 Proof of correctness for DPAXOS 


In this section, we prove the correctness of DPAXOS. We recall that since at most one round is run in 
a configuration, we use configuration identifiers as round numbers (round numbers are elements of 
the set G). Also, we say that a configuration c is successful in a state s when s.Haccquo(c.id) # 1; 
informally this means that the round conducted in the configuration is successful, and a decision is 
made by the leader. 


Next we provide invariants needed to prove agreement. We start with some basic invariants. 
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Invariant 7.2.1 (DPAXOS) 
In any reachable state the following is true. Let c be a configuration established at a process p and 


such that c.id > go. Then Hvalue(c.id)p # L. 


Proof: Since configuration c is established at p and c.id > go, we have that action NewstTaTE(r,v)p 
for some r € G and v € X, with v # L, has been executed (this is not true for c = co). By the code 


of this action we have that Hvalue(c.id)p = v. O 


Invariant 7.2.2 (DPAXOS) 
In any reachable state the following is true. Let c be a successful configuration and let p = c.ldr. 


Then Haccquo(c.id), # L and c is established at Haccquo(c.id), and at p. 


Proof: In order for a configuration c to be successful, the leader p = c.ldr must propose a value. 
Clearly it must be that c is established at p. In order for a configuration c to be successful, the leader 
p must execute action RESPOND((“begin”, Q),i)p which sets Haccquo(c.id), to Q. Clearly any process of 


Q must have established c. o 


Invariant 7.2.3 (DPAXOS) 
In any reachable state the following is true. Let c be an established configuration such that go < c.id. 


Then for any two processes p,q that have established c, we have Hvalue(c.id), = Hvalue(c.id),. 


Proof: Process q sets the Hvalue(c.id), to v when newstare((r,v))q for configuration c is executed. 
Process p sets the Hvalue(c.id), to v' when newstate((r’,v'))p for configuration c is executed. By the 
code of the DLC service, every process gets the same state for configuration c, that is r' = r and 


vi =v. o 


The following invariant states that when a configuration c is successful, any other configuration 


up to (and including) the next totally established configuration is for the same value as c. 


Invariant 7.2.4 (DPAXOS) 
In any reachable state the following is true. Let c be a successful configuration. Then for any 
configuration c' established at a process q, with cid < c'.id and such that there are no totally 


established configurations in between c and c', we have that Hvalue(c'.id), = Hvalue(c.id). tar # 1. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
is no successful configuration. 

For the inductive step assume that the invariant is true in a state s. We need to prove that the 
invariant is true in s’ for any step (s, 7, s’). Consider s’ and fix c and c' as required by the statement 


in s’. Let p= c.ldr. By Invariant 7.2.2, in state s configuration c is established at p and at the 
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accepting-quorum Haccquo(c.id), = Q € c.grms. So we have that Q C s.state-dlu[c.id]. 
Now we distinguish two cases: (i) configuration c’ is established at q in state s, (ii) configuration c’ 
is not established at q in state s. In the former case the invariant follows by the inductive hypothesis. 
We need to consider the latter case. Since c’ is not established at q in s and is established at q in s' 
it must be the case that 7 =NewstatE((r,v))q for some r € G and v € X. Configurations c and c’ are 
not dead in s‘, as well as in s, because they are established in s’; moreover, by assumption, we have 
that in s’ there is no totally established configuration between c and c'. Hence, by Invariant 7.1.1, 
there exists a quorum Q! € c.grms such that Q' C c’.set. By the properties of quorums, there 
exists a process g’ € QQ’. For such a process we have that q' € c'.set and gq’ € s.state-dlv[c.id]. 
Let s' be the state in which process g' executes action susmrr-staTE((r’,v')) that submits the state 
of q' to the condenser function ¢state for c'; since q' € s.state-dlv[c.id] and in state s'’ we have 
that current, = c'.id, it must be the case that q' € s’.state-dlu[c.id] (because q’ will not execute 
any other action for configuration c once its current configuration is c'). Hence the pair (r’,v') 
that q’ submits to the condenser function dgaze for c’ is such that r’ > c.id. By the definition of 
dstate we have that the pair (r,v) returned by action 7 is such that r > c.id. Hence we have that 
s' Hfrom(c'.id), > c.id. 
Let q'’ be the process from which the @state function takes the pair (r,v) returned with action 7; thus 
v = s'.Hvalue(r)qv. By the code we have that s’.Hfrom(c'.id), =r and that s’.Hvalue(c'.id), = v. 
Hence s'.Hvalue(c'.id), = s'.Hvalue(r)q. 

If r = c.id then we consider two cases. 
Case (i): r = go. Then we claim that q" = p. Indeed if the configuration with the biggest identifier 
among those submitted to the condenser function for c' is co, this means all members of c', when 
they submit their state to the condenser function, have not established any other configurations with 
identifier greater than go. This implies that all processes except p submit (1,-) to the condenser 
function ¢state and p submits (go, vp). Hence p is selected by the condenser function @state. Thus 
we have s’.Hvalue(c'.id), = s'.Hvalue(c.id)p, as needed. 
Case (ii): r > go. Obviously we have that s'.Hvalue(c!.id), = s'.Hvalue(c.id) gv. By Invariant 7.2.3 
we have that s’.Hvalue(c.id)q = s'.Hvalue(c.id),, and the invariant holds in this case. 

It remains to consider the case r > c.id. By the inductive hypothesis applied to c, r and q" we have 
that s’.Hvalue(r)q = s'.Hvalue(c.id),. Hence we conclude that s'.Hvalue(c'.id), = s'.Hvalue(c.id),,, 


as needed. Oo 


The following invariant is similar to the previous one, but considers totally established configu- 
rations instead of successful ones. It states that when a configuration c is totally established, any 
other configuration up to (and including) the next totally established configuration is such that its 


HAvalue is the same as that of c. First we give an auxiliary invariant. 
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Invariant 7.2.5 (DPAXOS) 

In any reachable state the following is true. Let c be a totally established configuration such that 
go < c.id. Then for any configuration c' established at a process q, with c.id < c'.id and such that 
there are no totally established configurations in between c and c', we have that Hvalue(c'.id), = 


Hvalue(c.id) tar. 


Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
is no established configuration c such that go < c.id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that the 
invariant is true in s' for any step (s,7, 5’). Consider state s’ and let c and c’ be as required by the 
statement in state s’. Let p= c.ldr. We distinguish four possible cases. 

CASE 1: c € s.Jot€st and c' is established at g in s. Then we can apply the inductive hypothesis. 

CASE 2: ¢ € s.Jot&st and c' is not established at q in s. Then it must be the case that 
1 =NEWSTATE((r,v))g- This action sets Hvalue(c'.id), = v. 

Since c’ is established in s‘ it is also not dead. Clearly also c is not dead in s'. By Invariant 7.1.1 

we have that there is a quorum Q of c such that Q € c’.set. Let q' € Q. For such a process we have 
that q' € s'.state-dlv[c.id], q' € c'.set. 
Let s" be the state in which process g' executes action susmrr-staTE((r’,v')) that submits the state 
of q' to the condenser function @state for c’; since gq’ € s.state-dlv[c.id] and in state s” we have that 
current, = c'.id it must be the case that q' € s".state-dlu|c.id] (because q' will not execute any other 
action for configuration c once its current configuration is c’). Hence the pair (r’,v') that q' submits 
to the condenser function @¢state for c' is such that r' > c.id. By the definition of dstate we have that 
the pair (r, v) returned by action z is such that r > c.id. Hence we have that s’.Hfrom(c'.id), > c.id. 
Let q” be the process from which the ¢stare function takes the pair (r,v) returned with action 7; thus 
v = s'.Hvalue(r)qv. By the code we have that s’.Hfrom(c'.id), =r and that s’.Hvalue(c'.id), = v. 
Hence s’.Hvalue(c' id), = s'.Hvalue(r) qi. 

If r = c.id then, we have that s'.Hvalue(c'.id), = s'.Hvalue(c.id)q. By Invariant 7.2.3 we have 
that s'.Hvalue(c.id)q = s'.Hvalue(c.id), and thus the invariant holds. So consider the case r > c.id. 
By the inductive hypothesis applied to c and r and q” we have that s'.Hvalue(r) qn = s'.Hvalue(c.id)p. 
Hence we conclude that s'.Hvalue(c'.id), = s'.Hvalue(c.id)p. 

CASE 3: c ¢ s.Jot€st, c' is established at q in s. Then it must be the case that 7 =NewstaTE((r,v)) » 
for some process p' that totally establishes configuration c. Configurations c and c' are not dead in 
s'. By Invariant 7.1.1 we have that there is a quorum Q of c such that Q € c'.set. Let q' € Q. For 
such a process we have that q' € s'.state-dlv|c.id], q' € c’.set. 

The proof proceeds as in the previous case: Let s” be the state in which process q' executes action 


sUBMIT-STATE((r’,v’)) which submits the state of g' to the condenser function state for c’; etc. (as done 
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in the previous case). 
CASE 4: c ¢ s.Jot&st, c' not established at q in s. This is not possible because a single action 


cannot make both c totally established and c’ established. iW 


The following invariant follows easily from the previous one. 


Invariant 7.2.6 (DPAXOS) 
In any reachable state the following is true. Let c be a totally established configuration such that 
go < cid. Then for any configuration c' established at c'.ldr, with c.id < c'.id, we have that 


Hvalue(c'.id) tar = Hvalue(c.id) ¢.tdr. 


Proof: Let c and c’ be as required by the statement. Let c,,c9,...,c, be the sequence in order 
of configuration identifiers of the totally established configurations properly between ¢ and c'. By 
Invariant 7.2.5 we have that Hvalue(c).id)-, 1dr = Hvalue(c.id)-tar; by the same invariant we have 
Hvalue(co.td)cy.tdr = Hvalue(cy.id)., 147 and so on up to Hvalue(c'.id).t¢- = Hvalue(cy.id) «, tdr- 


Thus we have that Hvalue(c'.id) «tarp = Hvalue(c.id) tar. O 


The following invariant is crucial to proving agreement. 


Invariant 7.2.7 (DPAXOS) 
In any reachable state the following is true. Let c be a successful configuration. Then for any config- 


uration c' established at c'.ldr, with c.id < c'.id, we have that Hvalue(c! id) tap = Hvalue(c.id) ¢.tar- 


Proof: If there are no totally established configurations between c and c’ then the invariant follows 
directly from Invariant 7.2.4. So assume that there exists at least one totally established with 
identifier strictly greater than c.id and strictly smaller than c'.id. Let c* be the totally established 
configuration having the smallest identifier strictly greater than c.id and let g be c*.ldr. Clearly we 
have that c* is established at g. By definition of c* there are no totally established configurations 
between c and c*. By Invariant 7.2.4 we have that Hvalue(c*.id), = Hvalue(c.id) ¢ tar- 

Since c.id > go, we have c*.id > go. Hence by Invariant 7.2.6 we have that Hvalue(c'.id). 1a, = 


Hvalue(c* .id),. Hence Hvalue(c!.id) tar = Hvalue(c.id) tar, a8 needed. 0 


We are now ready to prove agreement. 
Theorem 7.2.8 In any execution of the system DPAXOS, agreement is satisfied. 


Proof: In order to prove agreement we need to show that all the decision variables are set to 
the same value. By the code it is immediate that decision variables are always set to be equal 
to Hvalue(c.id)-.1dr for some successful configuration c. Hence it is enough to prove that any two 
successful configurations c and c' are such that s.Hvalue(c.id). 1g, = 8.Hvalue(c'.id) tap. 

Let p = c.ldr and p' = c'.ldr and without loss of generality assume that cid < c'.id. By 
Invariant 7.2.7 we have that s.Hvalue(c.id), = s.Hvalue(c'.id),. O 
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Validity is easier to prove since the value proposed in any round comes either from an initial 


value or from a previous round. 


Invariant 7.2.9 (DPAXOS) 
In any reachable state of an execution a, for any round r such that Hvalue(r),.1dr # L, we have that 


Hvalue(r),tar € Xa. 


Proof: By induction on the length of the execution a. The base case consists of proving that the 
invariant is true in the initial state. In the initial state Hvalue(r), is not L only for r = go and 
p=vr.ldr. Moreover Hvalue(go), is equal to the initial value of p. Hence the assertion is true. 

For the inductive step assume that the invariant is true in a reachable state s. We need to prove 
that the invariant is still true in s’ for any possible step (s, z, s’). 

Clearly the only actions that can make the assertion false are those that set Hvalue(r), for some 
round r and p = r.ldr. The only action that sets Hvalue(r), is action 7 =NEwsratE((r’,v))p» for round 
r. Action a sets Hvalue(r), to v. We need to prove that v € V,. This follows from the definition of 
dstate and the fact that the values submitted to state are the last-val, variables which, in turn, are 
either the initial value v, of g or the value Hvalue(r'),. 14, of a previous round r'; by the inductive 


hypothesis we have that Hvalue(r’),. 14, belongs to V,. O 


Theorem 7.2.10 In any execution of the system DPAXOS, validity is satisfied. 


Proof: Let a be an execution of DPAxos. A variable decision is always set to be equal to some 
Hvalue(r), # L for some r and p = r.ldr or to some other decision variable. By Invariant 7.2.9 we 


have that Hvalue(r), belongs to V,. Hence validity is satisfied o 


Finally we claim, informally, that termination is satisfied. We remark that we are making the 
assumption that any failure in the system is detected by the group communication service which 
changes the configuration in order to reflect the new status of the underlying distributed system. 

Consider an execution of the system such that there exists a state s in which a configuration c 
becomes totally established. Let ¢ the point in time at which the system enters state s. Assume that 
there are no failures after time t. There is nothing that can prevent the round run in configuration 
c from success. Thus the leader of configuration c eventually writes its own decision variable. Once 
having done that, the leader keeps sending (see code) the value of its decision variable to any other 
process member of the configuration until it receives an acknowledgment. Since there are no failures 
every member of the configuration c will eventually receive the message from the leader and write 


the decision variable. 
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7.2.5 Remarks 


We remark that the point-to-point communication mechanism of DLC is used by DLC-TO-PAXOS just 
to spread a reached decision to all members of the current configuration. Though it is fine to use 
configuration synchronous point-to-point messages, there is no need to require that messages used to 
spread the decision be configuration synchronous. A regular point-to-point channel which delivers 
messages regardless of the configurations in which the sender or the receiver are works fine too. 
Hence for the DPAXOS algorithm we could use a weaker version of the DLC service which provides 
point-to-point messages without configuration synchrony. We have used this stronger version because 
the algorithm that we present in the next section needs configuration synchronous point-to-point 
messages. 

The original PAXOS algorithm [61] is designed to work with majorities or with more general 
quorums of a static universe of processes. Using quorums is good for handling transient failures of a 
system. However it does not work well for permanent failures. The usefulness of building PAXOS over 
the DLC group communication service is that it can adapt also to permanent failures by changing 
the configuration of the system. 

It would be useful to compare the performance of the PAXOS algorithm built on top of DLC with 
that of the original PAXOS algorithm. Since our work has not addressed performance issues we leave 
this as future work. 

The same technique that we have used to build PAXOS on top of DLC could be used to build 
MULTIPAXOS on top of DLC. The MULTIPAXOS algorithm [61]' is basically a sequence of instances 
of the PAXOS algorithm that run together and optimize the number of messages needed in the first 
part of the round. The optimization is achieved by sending a unique message that works for all the 
instances of PAXOS. By using the DLC service such an optimization would be obtained by running 
multiple instance of PAXOS in the same configuration; the state exchange needs to be done only once 


for all the instances. 


7.3 A Replicated Atomic Object Algorithm 


In this section we develop a data replication algorithm that implements a replicated atomic object 
with arbitrary operations (not necessarily just read and write, though in practice these are the most 
common type of operations used). The algorithm, called RAB (Replicated Atomic oBject), is built 
upon the DLC service and uses a primary site to handle access to the object. 

We start with an informal description of the algorithm, then we provide the formal code and 
finally we provide key arguments for its correctness. Providing a formal proof of correctness is left 


1The name MULTIPAXOs is actually used in [29]. The original paper by Lamport [61] uses a different name (multi- 
decree parliament protocol). 
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as future work. 


7.3.1 Description of RAB 


Operations are centralized at the leader of the current configuration; the leader requires the collab- 
oration of at least a quorum of processes in order to handle requests. 

Clients of the service request to perform operations on the data. Each process accepts requests 
from its client and places them in a local order. Then each of the received requests is sent to 
the leader who is responsible for building a global order for all the client’s requests. For each of 
these requests the leader makes sure that at least a quorum of processes know the request before 
providing an answer to the process that originates the request. Once such an answer is provided to 
the originator process, a response can be given back to the client. 

When a configuration change happens all the members submit their knowledge about the requests 
performed so far and a new common state is computed from the local knowledge of the processes. 
In particular, each process submits its own information about the global order of operations plus 
all the local requests that are still pending, that is, have been submitted to the leader but have not 
received a response. The global orders submitted by each member of the new configuration are used 
to compute the most up to date global order, while the information about pending requests is used 


to locate those operations that must be resubmitted to the leader. 


7.3.2. The code of DLC-TO-RAB 


In this section we provide the code of algorithm DLC-TO-RAB. We first define some data types. We 
denote by ¥ the set of values that the shared data can assume and by uo € ¥ a predefined value. 
The set 7 is a set of types of operations (e.g., read, write operations). 

The set D of “operation descriptors” is defined as D = {(p,t,w,i)|p € P,t€ T,w € X,i € N7°}. 
Operation descriptors are used to describe both the requests from the clients and the corresponding 
responses. For an element y = (p,t,w,i) of D we use the following selectors to extract the single 
components: y.origin = p, y.type = t, y.param = w and y.local-rank = i. Component y.origin 
records the client at which the request has originated. Component y.type specifies the type of 
operation. For example, if we want a read-write register, types could be T = {“read”, “write” }. 
Component y.param provides possible parameters that need to be passed along with the request 
or with the corresponding response. Considering again the case of a read-write register, a write 
needs to pass the value to be written and the response to a read needs to pass the value read. For 
simplicity we assume that only one value needs to be passed and this value is an element of XY (this 
is so in the case of a read-write register). 

The set of messages that can be sent over the point to point channels and through the DLC service 


is defined as M = DU ({“regq”, “ans”} x D). The set of operation identifiers is OJD = N7° x P. 
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The set S is defined as S = {{o,d,a, h)|o € arrayof (D),d € arrayof (bool), a € arrayof (bool), h € G}. 
We remind the reader that the notation arrayof(D) indicates an array whose elements are either 
elements of D or L. The set of acknowledgment values is A = {“ack”} US. The set of response 
values is R = {“done”} U {(0,d,a)|o € arrayof (D),d € arrayof (bool), a € arrayof (bool)}. 


The DLC-TO-RAB algorithm uses two condenser functions that we define in the following: 


© ¢done: This condenser function takes a set of acknowledgment values “ack” and returns the 
string “done”. This function is used by the leader to make sure that a quorum of processes 


have received information about a particular operation. 


© drabstate: Let c be the configuration for which this condenser function is to be used. The 
condenser function takes a collection S C S of tuples, one for each member of c, and returns 


a triple (o,d,a) € R, defined as follows: 


— o is defined as follows: Let R be the set of tuples of S that have the maximum high 
component. For any i € N7° such that there exists at least one element « € R with 
x.order(i) # 1, fix any such element x and set o(i) := x.order(i). For any i for which no 


such element exists set o(i) := LL. 
— dis defined as the “or” of the reg-done components in R. 


— ais defined as the “or” of the req-answrd components in R. 


The code of DLC-TO-RAB is provided in Figure 7-3; we describe it next. We start with the 
description of the state variables. 

Variable current, contains the current configuration of process p and variable high,, contains the 
latest established configuration of process p. Variable local-req,, is the sequence of requests that the 
client submits at process p; variable local-ans, contains the answers for all of the requests. Variable 
next, is a pointer used to insert new requests from the client into local-req,. Variable order, contains 
the sequence of all requests as known by process p. Variable status, contains the status of process 
p; it is used when a new configuration is announced; for regular computation this variable is set to 
normal. 

The remaining state variables are flags used to record that some particular actions have happened. 
Variable req-sent(j), is set to true when the j‘” request of the client at p, that is local-req(j) p, 
is sent to the leader of the current configuration. Such a request, when received by the leader 
is placed in the global sequence of request order into some available position, say i. Variable 
reg-sbmttd(i), is set to true when the leader p has submitted the request via the DLC service 
with action susMit(m, ¢done; (i,p)). Wariable req-acked(i), is set to true when process p has sent an 
acknowledgment for the i*” request. Variable reg-done(i), is set to true when the leader p receives 


the response from the DLC service for the gth request with action rEsponD(“done”,(i,p)). Variable 
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DLC-TO-RAB 
Signature: 


Input: 
P2P-RECV(m)p, m € M, p€ P 
DELIVER(M,1i)p,m € M,i€ OID, pEP 
RESPOND(a,i)p, a€ A, i € OIDy, pEP 
NEWCONF(c)p, ¢ € C, p € c.set 
NEWSTATE(S)p, SE S,p EP 

State: 

current € C, initially co if p € Po, else L 

high € C_ initially L 

local-req € arrayof(D), initially L everywhere 

local-ans € arrayof (D), initially everywhere 

neat CN, initially 1 

order € arrayof(D), initially L everywhere 

status € {normal, exch-ready, exch-wait}, initially normal 


Derived variable: 
apply-all(i)p is defined as follows: 
if for all k <i, order(k)p # L then 


READ(desc, param)p, desc € D, param € X,p EP 


Output: CONFIRM(param)p, param € X,pEP 


P2P-SEND(m)p, m € M, p€ P 


SUBMIT(m, ¢,i)p,m €M, bE ® pe P,i 
ACKDLVR(a,1)p, a € A,i€ OID, pEP 
SUBMIT-STATE(S,¢)p, SES, PEP, pEP 


reg-sent € segof (bool), initially false everywhere 
req-sbmttd € seqof (bool), initially false everywhere 
reqg-acked € segof (bool), initially false everywhere 
reqg-done € segof (bool), initially false everywhere 
req-answrd € segof (bool), initially false everywhere 
req-cnfrmd € seqof (bool), initially false everywhere 


apply-all(i)» = (q,t,w,j), where q = order(i)p.origin, t = order(i)p.type, 
w is the value obtained by applying operations order(1,..,i), to the initial value vo in order, 


and j = order (i)p.local-rank; 
else 


apply-all(i)p = L. 


Actions: 


input REQUEST(type, param) p 
Eff: local-req(next) := (p, type, param, next) 
nest := neat + 1 


output CONFIRM(param) p 
Pre: local-ans(i) = (p, type, param, 1) 
Vi <4, req-enfrmd(j) = true 
req-cnfrmd(i) = false 
Eff: req-cnfrmd(i) := true 


output P2P-sEND({“req”,m))p,q choose j 
Pre: current # L 
q = current.ldr 
m = local-req(j) 
m.local-rank = j 
req-sent(j) = false 
status = normal 
Eff: req-sent(j) := true 


input P2P-RECV((“req”,™))q,p 
Eff: Let 7 be such that 
Vk <i, order(k) A L 
order(i) = L 
order(i) :=m 


output SUBMIT(M, ddone: (1,P))p 
Pre: current # L 
p = current.ldr 
order(i) =m 
mA L 
req-sbmttd(i) = false 
status = normal 
Eff: reg-sbmttd(i) := true 


input DELIVER(m, (i,7))p 
Eff: order(i) :=m 
req-acked(i) := false 


output ACKDLVR(“ack”, (i,7))p 
Pre: r = current.ldr 
Vj <4, order(j) A L 
req-acked(i) = false 
status = normal 
Eff: req-acked(i) := true 


input RESPOND(“done”, (i, P))p 
Eff: req-done(i) := true 


output P2P-sEND((“ans”, @))p,q choose 7 
Pre: p = current.ldr 
order (i) = (m,j) for some m 
req-answrd(t) = false 
req-done(i) = true 
Vk <4, order(k) AL 
Vk <i, req-answrd(k) = true 
q = order(i). origin 
a = apply-all(i) 
status = normal 
Eff: req-answrd(i) := true 


input P2P-RECV((“ans”, a))q,p 
Eff: j := a.local-rank 
local-ans(j) := a 


Figure 7-3: The DLC-TO-RAB code. 
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OID» 


DLC-TO-RAB 


input NEWCONF(c)p input NEWSTATE((o, d, a))p 
Eff: current := c Eff: status := normal 
status := exch-ready high := current 
Vi, order(i) # L and reg-acked(i) = false order := 0 
req-acked(i) := true req-done := d 
req-answrd := a 
output SUBMIT-STATE((9, d, a, 9), drabstate)p Vj, local-req(j) A L 
Pre: status = exch-ready if Az such that 
o = order order (t).origin = p and 
d = req-done order(i).local-rank = j then 
a = req-answrd req-sent(j) := false 
g = prev.id Vi, order(i) A L 
Eff: status := exch-wait if req-done = false then 
req-submttd(i) := false 
else 


req-submttd(i) := true 


Figure 7-4: The DLC-TO-RAB code (cont'd). 


th request to the 


reg-answrd(i), is set to true when the leader p has sent an answer for the i 
originator process of that request. Variable reg-cnfrmd(j), is set to true when process p has given 
the client a response for the j*” request submitted at p. 

Next we describe the transitions. 

Action REQUEST(type, param), records a new request from the client at process p in the sequence of 
local requests local-req,. Pointer next, always points to the first available location in the sequence 
local-req,,. 

Action coNFIRM(param)p provides the response to the requests of the client. Such responses are 
given in the same order as they are received. This is accomplished by using variable req-cnfrmd,; the 
response to the j* (local) request is given back to the client after all responses for the previous re- 
quests have been provided, and, of course, when the response is available, that is, when local-ans(j), 
has been set. 

Action P2p-sEND({“reg”,m))p,q is used by process p to send the request m to the leader. Such a 
request is received by the leader with action p2p-rEcv((“req”,m))q,p- The leader inserts request m into 
the global order of requests order, in the next available position. 

Once the leader p has placed a request in the i” position of order p, it executes action suBMitT(m, ¢dones (isP))p; 
where m = order(i),. The leader needs to make sure that at least a quorum of processes learn about 
the request. 

A request m submitted by process q is delivered to process p by the DLC service by means of action 
DELIVER(m, (i,7))p- Upon receiving such a request process p simply updates its own order by placing 
the request m into order(i),. We remark that the code allows for overwriting a previous value; 
however it is never the case that a process p overwrites an old value of order(i) with something 
different received with action vetiver(m,(i,r))p. Flag reg-acked(i), is set to false so that action 


AckbLyr will send an acknowledgment. 


136 


Action acxptvr(“ack”, (i, r))p sends back to the DLC service an acknowledgment for the 7*” operation 
and sets req-acked(i), to true. 

Once a quorum of processes has sent acknowledgments for a particular request, the DLC service 
notifies the leader with action rEsponp(“done”,(i,p))p- The leader simply sets the flag reg-done(i) to 
true to record the fact that now request 7 is known by a quorum of processes. 

Once all the requests up to the 7” one are known to a quorum of the processes, the leader can 
send an answer to the originator of the request. This is done in action p2P-sEND((“ans”, val))pq- The 
code of this action uses the derived variable apply-all(i) which applies all operations up to the i” 
and returns a tuple a € D that contains the response for operation i (a.param is the value of the 
shared data after the i*” operation). We remark that, of course, a real implementation will only 
keep the current value; it would apply operations in order and provide an answer to operation i right 
after applying it and before applying operation i + 1. 

When process p receives the answer for a request previously sent to the leader it just records the 
answer into local-ans,. This is done in action p2p-Recv((“ans”, a))q,p- 

Finally we describe the actions used for the state exchange. When a new configuration is an- 
nounced with action newconr(c)p process p sets its current configuration to c and goes into reconfigu- 
ration mode by setting status, to exch-ready. It also sets req-acked, to true for all those operations 
that have pending acknowledgments to be sent; since the old configuration has been left, such an 
acknowledgment must not be sent anymore and setting the flag reg-acked, to true has this effect. 
Indeed in the new configuration process p has not yet received any message from the leader so it is 
incorrect to acknowledge a message. 

Then process p submits to the DLC service its order,, req-done,, reg-answrd, and high,, which 
constitute the relevant part of the state that has to be exchanged. 

When all the processes have submitted their states, the DLC service is able to compute the starting 
state of the new configuration by using the ¢;abstate condenser function. Then it gives this state to 
process p by means of action NewstTaTE((o,d,a))p- When this action is executed, process p updates 
its order,, req-done, and reg-answrd,, state components. It also adjusts the values of reg-sent,, and 
req-submitted,, to take care of two problems that arise in establishing a new configuration, as we 
explain below. 

The first problem is that any process p has to check whether all of its local requests are in 
the global order o returned by newstatE((o,d,a))p; for any local request not included in the order o, 
process p has to send that local request to the leader because the leader of the new configuration 
does not know about such a request. This is done by setting reg-sent(j) to false for those local 
operations that the leader does not know about. 

The other problem regards operations that are included in the order o returned by newstate((o, d,a))p, 


but for which reg-done is still false. For such operations the leader cannot be sure that a quorum 
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of processes have them in their global order and thus cannot provide an answer for such operations. 
The leader needs to resubmit such operations to the DLC service in order to make sure that a quorum 
of processes learn about them, before giving an answer to the originator of the request. 

A simple scenario that illustrates this problem is the following. Assume that quorums are just 
majorities, the initial configuration has membership {p,, p2,p3, p1, Ps, Pe}, the leader is pg and the 
identifier of the configuration is 1. Process pg receives a request op; from its client, puts it into 
its local requests list and sends it to the leader (which in this case is itself). Then process peg, 
upon receiving its own request, puts op; into the first position of the global order. Process pg 
submits the request to the DLC service, but before any other process gets its message, a configuration 
change happens. A new process p7 joins the system and configuration 2, whose membership set is 
{p1, P2, P3, P4, Ps, Pe, P7}, is created. The leader of this configuration is p7. Every member submits 
its state. Only pg has something in its global order (namely op, in position 1) and thus the new 
global order computed by ¢;abstate for configuration 2 just contains op; in position 1. Moreover we 
have that reg-done for this operation is false because the previous leader did not succeed in having 
a quorum learn about such a request. Assume that the global order for configuration 2 is delivered 
only to processes pg and pz, but not to the other members. The leader p7 cannot yet give back a 
response to pg for the request op; indeed if the leader does so it allows process pg to give an answer 
back to its client and an inconsistency may arise as we show next. Assume that configuration 3 is 
established. This new configuration has membership set {p1, p2,p3,p4,ps} and leader p,. No one 
in configuration 3 knows that op; has even been submitted. The global order conveyed by action 
NEWSTATE(0, d,a) is empty. A new operation op, comes in, say from process pz, the leader p; receives 
it, puts it into the first position of the global order, submits it through the DLC service and receives 
acknowledgments from a quorum and gives an answer back to process po. We have an inconsistency: 
process pe told its client that ope is the first operation applied to the shared object while process pg 
told its client that operation op, is the first operation applied to the shared object. Hence, before 
giving a response, the leader of a new configuration needs to submit to the DLC service operations 
that have not successfully gone to a quorum through the DLC service. 

Thus for any i such that order(i), # L and req-done(i) = false, that is, for any operation in 
order for which the leader cannot be sure that a quorum of processes know about that operation, 
the flag reg-submitted(i) is set to false so that the operation will be submitted to the DLC service. 

Another way to get around the problem mentioned above, is to delay the response for an operation 
i, for which the leader does not know that the operation has been spread to a quorum, until a later 
operation j, j > 7, is spread to a quorum in the same configuration. When operation 7 is known 
to a quorum Q, we have that also operation 7 is known to a quorum because each of the processes 
in Q has to establish the configuration and thus knows operation i. The RAB algorithm adopts the 


solution of submitting operation i to the DLC service to make sure a quorum knows it. 
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7.3.3 Sketch of proof of correctness 


In this section we present the key arguments for a correctness proof for the RAB algorithm. 

The overall algorithm RAB consists of the composition of DLC and automaton DLC-TO-RAB, 
for each p € P. We claim that the RAB algorithm implements an atomic shared object. An 
atomic shared object is an object that can be accessed concurrently by several processes that issue 
invocations (requests) and receive responses for those requests in such a way that it is possible to 
insert serialization points that make the responses consistent with all previous (with respect to the 
serialization points) events. In the RAB algorithm invocation events are the actions requEsT, and 
response events are the actions conrirmp. We refer the reader to Chapter 13 of [65] for a formal 
definition of atomic object. 

We define the history variable build-order(g,i), € D1 for each process p € P, each configuration 
identifier g € G and i € N7°. Such a variable is L if process p has not established g, otherwise is 
defined as follows: if current.id, > g then build-order(g,i), is equal to the value order(i), when p 
left configuration g; if current.id, = g then build-order(g,i), = order (i)p. 

Next we provide key invariants that will be used to prove that RAB implements an atomic 
object. Remember that variable state-dlv|c.id] contains the set of processes that have established 


configuration c. 


Invariant 7.3.1 In any reachable state the following is true. Let c be a configuration such that 
c.ldr ¢ state-dlu[c.id]. Then for any p,q € state-dlu[c.id] and anyi € N*°, we have that build-order (c.id, i)» = 
build-order(c.id, 4). 


Proof: When a configuration is established at a member p, process p executes action NEwsTATE((o, d, a))p 
and sets order, := 0. The tuple (0, d,a) is the same for every member of the configuration. So ini- 
tially every member has the same value of order. Within the configuration a member p updates 
order(i), to a particular value m only when the leader r executes action pEtiveR(m, (i,r))p; but since 


the leader has not established c, such an action cannot be executed. 0 


Invariant 7.3.2 In any reachable state the following is true. Let c be a configuration such that 
c.ldr € state-dlv[c.id]. Then for any p € state-dlv[c.id] and anyi € N°, we have that if build-order(c.id,i), 4 
L then build-order(c.id, i)p = build-order (c.id, i) ¢tar. 


Proof: When a configuration is established at a member p, process p executes action NEwsTATE((o, d, a))p 
and sets order, := o. The tuple (o,d,a) is the same for every p member of the configuration. So 
initially every member has the same value of order,. Within the configuration a member p updates 
order(i), to a particular value m only when executing action DELIVER(m, (i,r))p; but in this case we 


have that the leader of the configuration is r and order(i), =m. O 
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We remark that the knowledge of order may diverge for those processes that remain in obsolete 
configurations. For example if a process p updates order(i), to m because it receives such an m from 
the leader of a configuration c1, but a new configuration ce is established before any other process 
updates order(i), then the only two processes to have order(i) =m might be p and c,.ldr. Assume 
that neither p or c,.ldr is a member of cy. Then the leader of cp can write something different from 
m into order(i). The above scenario is possible because m is not known to enough processes. 

Given an index i, an m € D and a configuration c we say that the triple (i,m,c) is good 
in a state s if there exists a quorum @ € c.grms such that for every process p € @ we have 
s.build-order(c.id,i),) =m. 

We also say that (i,m) is good if there exists a configuration ¢ such that (i,m,c) is good and 
that an index i is good if there exists m such that (i,m) is good. 

The definition of a good index admits the possibility that in a given state there exist m and m’ 
such that (i,m) and (i,m’) are both good; however, as we will see later, this never happens in the 
algorithm. Indeed the notion of good index is intended to capture the fact that an operation m has 
been assigned to the i” position of order. In the following we will prove that operations assigned 


to good indexes are propagated to newer configurations. 


Invariant 7.3.3 In any reachable state the following is true. Suppose that w € Est and c € Est 
such that w.id < c.id, and there are no totally established configurations x with w.id < a.id < cid. 


Suppose (i,m, w) is good. Then order(t), =m for every p € state-dlv|c.id]. 


Proof: We prove the invariant by induction on the length of the execution. The base case is 
vacuously satisfied. 

For the inductive step, let s be a reachable state and assume that the invariant is true in all 
states previous to s. We need to prove that the invariant is true in s. 

Let w and c be configurations satisfying the assumption of the statement. Let (i,m,w) be good 
in s. Since (i,m, w) is good in s, there exists a quorum Q € w.grms of processes such that for each 
r €Q we have s.build-order(w.id, i), =m. For the rest of the proof we fix such a Q. 

We need to prove that any process p € s.state-dlu|c.id] has s.order(i), = m. Processes that 
establish c set their order variables to the value computed by the condenser function @¢;abstate for 
configuration c. Hence we need to look at the inputs that the condenser function ¢,abstate for 
configuration c receives from the members of c. 

Since c is established, all members of c submit their state to the condenser function ¢;gpstate for 
configuration c. Partition c.set into three subsets 51, S2 and S3, as follows: S; contains the processes 
that had high < w.id at the moment they submitted the state to ¢;abstate for configuration c; S2 
contains the processes that had high = w.id at the moment they submitted the state to ¢,abstate for 
configuration c; S3 contains the processes that had high > w.id at the moment they submitted the 


state to drabstate for configuration c. 
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In the following we provide three claims that will be used to complete the proof. 


CLAIM 1: S,US3 #9. 


PROOF OF CLAIM 1: By Invariant 7.1.1, a quorum Q! € w.grms is included in c.set. 
Let r be a process in QM Q'. Such a process exists because Q and Q’ are quorums of 
w. Clearly r € c.set. Since process r € Q we have that s.build-order(w.id,i), =m. By 
definition of build-order we have that process r has established w in state s. Process r 
has to establish w before submitting its state for c, because it does not take any action 
for w after participating in c. Hence at the moment r submits its state for c we have 


that high, > w.id. Therefore r € SyUS3. Thus S2U S3 # 0. 


CLAIM 2: If g € S2U.S3 then g submits either m or L as order(i) to the condenser function ¢,abstate 


for configuration c. 


PROOF OF CLAIM 2: Fix any g € SU S3. Let s’ be the state in which process q 
submits its state for the condenser function ¢,abstate for configuration c. We need to 
prove that s’.build-order(w'.id,i), =m, where w’ is the configuration that q establishes 


before joining c. (Configuration w' is equal to w for g € S2 and is a later one for g € $3.) 
We consider two cases: g € Sy and qg € $3. 


CASE 1: q € 52. In this case w’ = w so we need to prove that s'.build-order(w.id,i), is 


either m or L. 
We first notice that state s’ precedes state s. 


We consider two cases: (i) q € Q, and (ii) q ¢ Q. 


CASE 1.1: q € Q. Since in s’ process q already participates in c and c.id > w.id 
we have that after s’ process g does not execute any action for configuration w 
and thus build-order(w.id,i), does not change after s'. Since g € Q, we have 
that s.build-order(w.id,i), = m. Since build-order(w.id,i), does not change 
after s’, we must have s’.build-order(w.id, i), =m. 

CASE 1.2: q ¢ Q. We first notice that since gq € Sj we have that that 
q € s'.state-dlu[w.id]. This implies that g € s.state-dlu[w.id]. 

If s.build-order(w.id,i), = -L, since process q has already left configuration w 
by state s’, we have that s’. build-order(w.id, i), = L, as needed. Hence assume 
that s.build-order(w.id,i)g # L. 

By Invariant 7.1.1, a quorum Q” € w.grms is included in c.set. Since Q and 
Q" are quorums of w, there exists a process r€ QM Q". Clearly r € c.set. 


Since r € Q we have that s.build-order(w.id,i), =m. 
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Next we prove that s.build-order(w.id,i)y =m. 

If w is not established at w.ldr in s, by Invariant 7.3.1 we have that s.build-order(w.id, i), = 
s.build-order (w.id, i), =m. 

If w is established at w.ldr in s, by Invariant 7.3.2 we have that s.build-order (w.id, t) w.ldr = 
s.build-order(w.id,i), =m. By the same invariant, since s.build-order(w.id, i)q # 

-L, we have that s.build-order(w.id, i), = s.build-order(w.id, i)w.tar = ™m; hence 

also in this case s.build-order(w.id, i)y =m. 

Thus we have that s.build-order(w.id,i), =m. 

Since in state s’ process q has already left configuration w, we have that after s’ 

process g does not change build-order(w.id,i),. Since s.build-order(w.id, i) = 


m we have that s'.build-order(w.id, i), =m, as needed. 


CASE 2: q € S3. This process arrives in configuration c from a configuration w’ such 


that w.id < w'.id < cid. We need to prove that s’.build-order(w' id,i), =m. 


By the inductive hypothesis we have that the statement is true in s’. By applying the 
inductive hypothesis to state s’ with c = w’ we have that s'.order(i)y =m. By definition 


of build-order we have that s'.build-order(w’ .id,i), =m, as needed. 


CLAIM 3: At least one process in 52 US3 submits m as order(i) to the condenser function ¢,abstate 


for configuration c. 


PROOF OF CLAIM 3: By Invariant 7.1.1, a quorum Q”” € w.grms is included in c.set. 
Since Q and Q’” are quorums of w, there exists a process r € QNQ""'. Clearly r € c.set 


and also r € S2U S3. Since r € Q we have that s.build-order(w.id,i), =m. 


If r arrives to configuration c directly from w, then since s.build-order(w.id,i), = m, 


process r submits m to the condenser function ¢,abstate for configuration c. 


Thus consider the case when r arrives to configuration c from w’, w.id < w'.id < c.id. 
Let s’ be the state in which process r submits its state to the ¢;abstate function for 
configuration c. By applying the inductive hypothesis to state s' with c = w’ we have that 
s' .order(i), =m. By definition of build-order we have that s'.build-order(w' id, i), =m. 
Hence, also in this case, process r submits m to the condenser function @¢;abstate for 


configuration c. 


We are now ready to conclude the proof. 
By Claim 1, we have that S2 U $3 contains at least one process. Thus, by definition of the 
condenser function ¢,abstate, we have that the states of processes in S; are ignored by ¢rabstate- SO 


we only need to worry about what processes in S2 U S3 submit to the state condenser function for 
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configuration c. By Claim 2, we have that processes in 92 and $3 submit either m or | as the i” 
entry of order to the state condenser function for configuration c. By Claim 3, at least one process 
in Sg and S3 submits m. 

Hence ¢,abstate Computes an order o for configuration c such that o(i) =m. 

Every process p that establishes configuration c sets order, := o when executing action NEwsTATE((o, d, a))p 
for configuration c. Thus, for such a process, we have that order(i), = m when it establishes con- 
figuration c. Within configuration c, process p modifies order, only when receiving a message from 
the leader. However the leader also has order(i)¢.id- =m. So order, does not change. 


Hence we conclude that if p € s.state-dlv|c.id], then s.order(i), =m. o 


The next invariant is similar to the previous one, but removes the requirement that there be no 


totally established configurations between w and c. 


Invariant 7.3.4 In any reachable state the following is true. Let w € Est and c € Est such that 
w.id < cid. Let (i,m,w) be good. Then order(t), =m for every p € state-dlv|c.id]. 


Proof: Let s be any reachable state and let w and c be as required by the statement. Let (i,m, w) 
be good in s. Let p € state-dlu[c.id]. We need to prove that s.order(i), =m 

Let x1, %2,...4% be the sequence, in order of configuration identifier, of totally established config- 
urations between w and c. Since (i,m, w) is good, by Invariant 7.3.3 we have that s.order(i)g =m 
for any process g such that q € state-dlv|x,.id]. Configuration x; is totally established, hence for any 
q € %1.8et we have s.order(i), =m. Hence we have s.build-order(21.id, i), =m for each member of 
Q1. 

It follows that (i,m,21) is good. Thus by Invariant 7.3.3, used with w = 21, we have that 
s.order(i)g = m for any process g such that g € state-dlu[x2.id]. Configuration x2 is totally estab- 
lished, hence for any g € %2.set we have s.order(i), =m. Hence we have s.build-order(x2.id,i)4 =m 
for each member of 22. 

It follows that (i,m, 22) is good. We can iterate this reasoning for 23,...,2, and obtain that 


s.order(i), =m for any process q such that q € state-dlu[c.id], as needed. O 
Next we show that in a given state no two operations can be good for a particular index 4. 


Lemma 7.3.5 In any reachable state, given an index i, there exists at most one operation m such 


that (i,m) is good. 


Proof: Fix any reachable state s. By contradiction assume that there exist m and m’ such that 
(i,m) and (i,m') are good in s and such that m # m’. 

By definition of good we have that there exists at least one configuration w’ such that (i,m, w’) 
is good in s. Let w, be the configuration with the smallest identifier among the configurations w’ 


for which (i,m, w') is good. Of course (i,m, wi) is good in s. 
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Similarly, by definition of good we have that there exists at least one configuration w' such 
that (i,m',w') is good in s. Let we be the configuration with the smallest identifier among the 
configurations w' for which (i,m', w') is good. Of course (i,m’, we) is good in s. 

Without loss of generality assume that w).td < w2.id. 

We now distinguish two possible cases. 

CASE 1: There exists c € s.€st such that wo.id < cid. Fix p € s.state-dlu[c.id|. By Invari- 
ant 7.3.4, applied with w = w1, since (é,m, w1) is good, we have that s.order(i), =m. 

By the same invariant, applied with w = wy, since (i, m’, we) is good, we have that s.order(i), = 
m'. This is a contradiction since m 4 m’. 

CASE 2: There is no c € s.€st such that w2.id < c.id. Since (i, m’, we) is good in s we have that 
there exists a quorum Q € c.set such that s.build-order(w2.id,i) = m' for all members of Q. Fix 
p€ Q. We have s.build-order(w2.id,i), =m’. Since there is no c € s.€st such that w2.id < c.id, we 
have that we is the latest configuration established by p. Hence we have that s.order(i), =m’. 

By Invariant 7.3.4, applied with w = w1, we have that s.order(i), =m. This is a contradiction 


since m 4m’. o 


The following lemma generalizes the previous one by claiming that, even across an entire exe- 
cution and not just in single state, we cannot have two different elements of D being stable at the 


same index. 


Lemma 7.3.6 Let a be an execution. Let s and s' be two states of a and let m,m' € D be such 


that m #4m!'. Then it cannot be that (i,m) is good in s and (i,m') is good in s'. 


Proof: By definition of good we have that if (é,m) is good in a state s then (i,m) is good in any 


subsequent state s’. Then the lemma follows easily by Lemma 7.3.5. 0 


The following lemma states that an index i for which an answer has been computed is good. 


Lemma 7.3.7 In any reachable state we have that if req-answrd(i), = true for some process p then 


i is good. 


Proof: Let s be a reachable state and assume that s.req-answrd(i), = true. 

Variable req-answrd(i), is set to true by process p either when providing the answer for op- 
eration i with action p2p-sEeNnD(({“ans”,a))p,q or when establishing a new configuration with action 
NEWSTATE((0, d, @))p- 

In the former case process p is the leader of some configuration c in which the answer for operation 
i is computed. Process p computes such an answer only after informing a quorum Q € c.grms, by 
means of the underlying DLC service, about order(i),. Hence i is good in s. 

In the latter case process p sets req-answrd, := a. So req-answrd(i), is true only if a(é) is 


true. But a(é) is true only if the leader p' of some previous configuration has computed an answer 
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for operation i and thus has executed action p2P-sENnp((“ans”,a)) for i. Let s’ be the state when p’ 
computed the answer for operation i. Clearly i is good in s'. But once i is good in a state it stays 


good in all subsequent states. Hence ¢ is good in s. O 


In order to prove that the system implements an atomic object we use the following lemma, which 
is a version of Lemma 13.16 of [65] (page 435) that considers general operations instead of specific 


read/write operations. 


Lemma 7.3.8 Let § be a (finite or infinite) sequence of actions of an atomic object external inter- 
face. Suppose that 6 is well-formed, and contains no incomplete operations. Let II be the set of all 
operations in B. Suppose that < is an irreflexive total ordering of the operations in II, satisfying the 


following properties: 
1. For any operation A € II, there are only finitely many operations B such that B ~ A. 


2. If the response event for operation A precedes the invocation event for operation B in 8, then 


AXB. 


3. The response for any operation A € II is the result of applying all the operations that precede 
A, including A itself, in the order x. 


Then 6 satisfies the atomicity property. 


Finally we can give the following claim. 
Claim 7.3.9 The system RAB implements an atomic object. 


Proof: By Lemma 13.10 of [65] (page 419) we can restrict our attention to executions with only 
complete operations. Fix such an execution a. Remember that II is the set of operations of a. 

In order to show that the system implements an atomic object we need to provide a total order < 
on II that satisfies Lemma 7.3.8. Let us define the order ~ as follows. 

Let A be an operation in II. By definition of II we have that A gets completed. In order for A 
to complete there must be a leader that stores A in some position i and computes an answer for A 
setting reg-answrd(i) to true. So there exists a state s of a such that s.req-answrd(i), = true for 
some p. By Lemma 7.3.7 we have that (i, A) is good in s. Denote by tag(A) the index i. This is 
well defined because of Lemma 7.3.6. Note that no two operations A and B can get the same tag i, 
because we would have that both (4, A) and (i, B) are good in some state contradicting Lemma 7.3.6. 


Order all operations in ITI in order of tag. 


Next we prove that ~< satisfies the hypothesis of Lemma 7.3.8. 
Let us start with Point 1. Fix A € II. Any operation B < A must have tag(B) < tag(A). The 


number of such operations is bounded by tag(A). 
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Now consider Point 2. Fix A, B € II and assume that the response event for operation A precedes 
the invocation event for operation B. 

Since A, B € II there exists a state s of a and two indexes i,j such that (i, A) and (j, B) are 
good in s. By definition of tag we have that i = tag(A) and j = tag(B). Hence we need to prove 
that <j. 

Let s' be the state when B gets invoked. Since the response event of A precedes the invocation 
event of B, we have that the response event of A precedes s’. 

Since A received a response before state s’, we have that there exists a state s’’ preceding s’ such 
that s".req-answrd(i), = true for some process” p. By the code (see action p2p-sEND((“ans”,a))p,q); 
we have that s.reg-answrd(k), = true for any k <i. By Lemma 7.3.7 we have that in state s” all 
indexes k <i are good in s”. Since s” precedes s’, indexes k < i are good in s’ too. 

This implies that for each index k < i there exists an operation A, such that (k, A;) is good in 
s'. Moreover none of these operations A, can be equal to B because B is invoked in state s’ and 
thus cannot be good in state s’. 

Remember that (j,B) is good in s. By Lemma 7.3.6 it cannot be that j < i because for any 
index k <i there exists an operation A, # B such that (k, A) is good in s'. 

Hence it must be that i < j, as needed to prove Point 2. 

Finally consider Point 3. This condition is true because responses are given in order of tag and 


by Lemma 7.3.6 this order is consistent for all operations in 7. 0 


7.4 Remarks 


The RAB algorithm implements an atomic shared object. As a particular case we may have a read- 
write register. In Chapter 6 the algorithm ABD also implements an atomic shared read-write register. 
The latter is built upon the DC service while the former is built upon the DLC service which is a 
variation of the DC service that defines a leader within each configuration. The two algorithms are 
similar (indeed they use similar services as building block): both rely on spreading each operation 
to a quorum of processes in order to keep data consistency. The main difference is that the RAB 
algorithm uses the leader of the current configuration in order to centralize the handling of the 
requests from the clients; within each configuration, the leader of the configuration is responsible for 
providing answers to the requests. With such an approach only the leader needs to have the most up 
to date information and thus, when the system is stable, this approach is more efficient. In the ABD 
algorithm at any time there is a quorum of processes which have the most up to date information. 


2This process p is the leader of the configuration in which the answer to operation A is computed. However this 
is not important for this proof. 
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As the Liskov-Oki algorithm [76], the RAB algorithm uses a centralized approach where a distin- 
guished process is responsible to perform requested operations; however, such a process needs the 
cooperation of a quorum of other processes in order to provide answers to the requested operations. 
Our algorithm is dynamic and does allow change in the universe of processes while the Liskov-Oki 
algorithm assumes a fixed universe of processes. The RAB algorithm uses a more conservative ap- 
proach in providing answers to requested operations: the leader does not respond to a requested 
operation until it knows that a quorum of processes have recorded that operation. In the Liskov-Oki 
approach the leader immediately respond to requested operations; this is more efficient when there 
are no failures, but in case of failures it is less efficient (roll back might be necessary). 

The MULTIPAXOS algorithm [61] can also be used to implement a replicated atomic object. Indeed 
processors can agree on the sequence of operations to perform on the shared object by running a 
sequence of instances of a consensus algorithm. The usefulness of developing the RAB algorithm is 
that we use a building block which provides a powerful service and thus much of the computation 
that needs to be done is delegated to the DLC building block. This results in a simpler algorithm. 
The overall algorithm is similar to an algorithm that would use the MULTIPAXOS approach; however 
the philosophy underlying the building blocks approach is that building blocks are built once and 
then can be used by many applications which can take advantage of powerful properties offered by 
the building block. With such a perspective designing RAB is easier than designing a replicated 
atomic object based on MULTIPAXOS (the interested reader can compare the code of MULTIPAXOS 


provided in [29] with the code of RAB). 
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Chapter 8 


Conclusions 


In this thesis we have provided a set of group communication specifications. We have also given 
implementations of the specifications and we have constructed applications on top of the specifica- 
tions. 

The main theme has been that of providing “dynamic” group communication specifications, 
that is, specifications for group communication services that adapt well to dynamic changes of the 
underlying distributed system. This is crucial in systems where processes can join or leave the 
system routinely because of process or link failures. 

In such settings, it is possible that the underlying system suffers partitions. In the presence 
of partitions two approaches can be followed: one is to allow every component of the partition to 
proceed independently; another one is to select a unique “primary” component of the partition and 
allow progress only in that component. The former approach improves availability at the expense 
of shared data coherence. The latter is to be used when replicated data needs to be maintained 
coherently. Most group communication services and specifications take the first approach: they are 
partitionable. 

When applications require a primary component but run over a partitionable group commu- 
nication service, it is the responsibility of the application to figure out whether it is in a primary 
component or not. Establishing whether the current component is primary or not is clearly indepen- 
dent of the particular application. Thus it would be better to move this problem from the application 
to a lower level layer. One possibility is to use a primary component group communication service 
as building block. 

In Chapter 5 we have considered the extension of existing partitionable group communication 
services to primary ones. We have provided a specification for a dynamic primary view group 
communication service called Dvs. The communication tools provided by Dvs are those typical of a 


group communication service; the membership service provides the client with primary views. 
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We have also shown that the DVS service is implementable. Our implementation is based on the vs 
service of Fekete, Lynch and Shvartsman [41] and uses ideas from the dynamic membership algorithm 
of Yeger Lotem, Keidar and Dolev [89]. The implementation filters the views provided by the vs 
service in order to establish whether the systems has partitioned and in such a case to report to the 
clients only views satisfying particular intersection properties with previous views. Such views are 
the primary components of the partition. By reporting to the clients only these primary views, the 
service enforces that computation proceed only in the primary component. 

In order to show the usefulness of the DvS specification we have developed an application on top 
of it. The application we have developed provides a totally ordered broadcast service: clients are 
allowed to broadcast messages to all other members of the system and the service guarantees that 
each member of the systems receives the messages in the same order. This is a very powerful service 


to develop replicated data algorithms or any other application that necessitates data coherence. 


In Chapter 6 we tackled the problem of extending dynamic primary view services to dynamic 
“configuration” services. A configuration is a view with a quorum system defined on the membership 
set of the view. The use of quorums is a well-known technique to improve availability and efficiency 
in a distributed system. With quorums usually a client request is serviced by a quorum of the set 
of all the members of the system (as opposed to the whole set of members). Our goal has been that 
of integrating the use of quorums in a group communication system. In particular we extended the 
DVS service to handle configurations. The result has been a specification for a primary configuration 
group communication service, called Dc. The main difficulty in developing Dc has been that of 
defining a dynamic primary configuration. The notion of dynamic primary view has been well 
studied (e.g., [55, 89]). As far as we know, there was no corresponding notion for configurations. 
We have developed such a notion and used it to specify the DC service. 

As for the DVS service, we have proved that DC is implementable and useful. The implementation is 
very similar to that of pvs. It uses a static service internally, which provides any new configurations. 
Then it filters these configurations to find those that satisfy certain intersection properties. These 
configurations are the primary configurations which are given to the clients of the service. 

The application we have developed on top of DC is a multi-writer/multi-reader atomic register. 
This application is based on the work of Attiya, Bar-Noy and Dolev [12] and that of Lynch and 


Shvartsman [66]. The algorithm exploits the quorum-oriented framework provided by the DC service. 


Finally in Chapter 7 we have explored the use of the techniques deployed in the development 
of Dvs and DC to the design of dynamic algorithms, i.e., algorithms that work well in dynamic 
distributed settings. 


Lamport’s PAXOS algorithm uses quorums to solve the consensus problem; however it is designed for 
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a static settings. We have used the DC service! in order to design a dynamic version of the PAXOS 
algorithm, a version that adapts well to system changes, even permanent ones. 

We also have provided a dynamic primary copy data replication algorithm. As the DPAXOS algo- 
rithm, also this algorithm uses (a variant of) the DC service as a building block. We have sketched 


the proof of correctness of this algorithm (the formal proof is left as future work.) 


Applications developed on top of powerful building blocks are easier to build than those built 
from scratch, because such applications can benefit from the guarantees provided by the building 
blocks. We have shown that our group communication building blocks are powerful enough to 
build interesting applications. We think that other applications can be built on top of the group 


communication services (or variations of them) provided in this thesis. 


An interesting feature of the DC specification is that it integrates a state exchange mechanism 
within the service. When a new configuration is delivered to the client, the client is supposed to 
submit its current state to the service. Once the service receives the state from all the members of 
the configuration, it computes a new up-to-date state and delivers this state to each member of the 
configuration. In this way the state transfer is relegated to the DC service. 

The DC service requires all members of the configuration to submit the state before computing a new 
up-to-date state. It would be interesting to explore the possibility of computing the new up-to-date 
state when only a quorum of the processes have submitted their state. Clearly the resulting service 
would be weaker, but it is possible that useful applications can still be constructed on top of this 


weaker service. The advantage would be a more available service. 


One of the major goal of the current research in this area is to provide simple, universally accepted 
specifications that describe the semantics of the existing group communication services already in 
use in real-world application. Probably it is not possible to give a unique specification good for all 
applications: different applications will require different group communication services, which will 
be tailored to the applications. Another approach consists of defining independent protocol layers 
that implement different service levels and semantics (e.g., as is done for example in Horus and 
Ensemble). The application developer can use any combination of these layers building the right 


semantics for the needed group communication service. 


Though much work has been focused in providing specifications for group communication services 
(we refer the reader to Chapter 2 for pointers to the literature or to [87] for a survey), the overall goal 
has not been achieved yet. It would be good to provide a set of universally accepted specifications 
for group communication services that cover all possible needs of applications. System implementors 


lWe actually have used a variant of the Dc tailored to the particular application that we have developed. See 
Chapter 7 for more details. 
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could then concentrate on efficient implementations of such specifications and application developers 
could build their applications on top of the guarantees provided by the specifications. 

In this perspective, the contribution of this thesis is that of having provided formal specifications 
for two particular group communication services tailored to applications that run in dynamic systems 


and that require primary views and configurations. 


Possible future work that follows the direction of this thesis may include the following. 

One could provide performance and fault-tolerance analysis of the algorithms presented in the 
thesis. We have focused our attention on the safety properties; though our algorithms are not 
naive”, we have not proved any performance property. Such a performance analysis could be based 
on assumptions on the underlying physical distributed system (as it is done in [41]). Moreover since 
we were concerned only with safety our algorithms are not tuned for optimal performance. 

Hence one could optimize the algorithms presented in the thesis and compare them with other 
algorithms. In particular it would be interesting to provide a dynamic version of the MULTIPAXOS 
algorithm built upon the DLC group communication service. In Chapter 7 we have provided a 
dynamic version of the PAXOS algorithm but not a dynamic version of the MULTIPAXOS algorithm. 
One could provide such a dynamic version of MULTIPAXOS and compare its performance with that of 
the original MULTIPAXOS algorithm (see [61, 29]). We have provided two algorithms that implement 
atomic objects: the ABD-SYS algorithm, which is based on a decentralized approach, and the RAB 
algorithm, which is based on a centralized approach. We have not tuned the code of these algorithms 
for efficiency. It would be interesting to provide optimized code and compare the performance of 
these algorithms with others that solve similar problems (e.g., the Liskov-Oki algorithm [76]). 

Another possibility is to provide variations of the group communication services presented in 
this thesis. An interesting one is that of weakening the DC specification presented in Chapter 6 in 
order to allow the computation of the starting state for a new configuration as soon as the processes 
in a quorum of the configuration have submitted their state (the version we have used requires all 
the members of the configuration to submit their state). We believe that this weaker version is still 


powerful enough to build useful applications. 


More generally it would be interesting to build other algorithms on top of the group communica- 
tion service building blocks provided in this thesis and also to provide new building blocks tailored 


to other applications. 


? An algorithm that does nothing is a safe algorithm. 
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Part Il 


Distributed k-set Consensus 
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Chapter 9 


Distributed k-set Consensus: 


Overview 


The problem of reaching consensus in a distributed system arises in many forms and various contexts, 
such as, for example, distributed data replication, distributed databases, flight control systems. Data 
replication is used in practice to provide high availability: having more than one copy of the data 
allows easier access to the data, i.e., the nearest copy of the data can be used. However, consistency 
among the copies must be maintained. A consensus algorithm can be used to maintain consistency. 
Another practical example of the use of data replication is an airline reservation system. The data 
consists of the current booking information for the flights and it can be replicated at agencies spread 
over the world. The current booking information can be accessed at any of the replicas. Reservations 
or cancellations must be agreed upon by all the copies. 

Distributed consensus has been extensively studied; a good survey of early results is provided in 


[42]. We refer the reader to [65] for a more up-to-date treatment of consensus problems. 


One of the most celebrated results about distributed consensus is the impossibility result of 
Fischer, Lynch and Paterson [43]. This impossibility result, popularly known as FLP, states that 
it is impossible to achieve distributed consensus in asynchronous systems even if only one stop 
failures is possible. This surprising result sparkled various directions of research aimed to solve the 
problem by either restricting the asynchrony of the computation model [31, 35], or using randomized 
protocols [18, 21, 80], or weakening the problem definition [24, 32, 39, 40]. 

The last of these three directions of research falls in the more general research area of demarcat- 
ing what is deterministically computable and what is deterministically impossible in asynchronous 
distributed systems in the presence of failures. The FLP impossibility seemed to suggest that no 
nontrivial problem could be solved deterministically and asynchronously in the presence of faults. 


Attiya, Bar-Noy, Dolev, Peleg and Reischuk [13] showed that the renaming problem can be solved in 
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a deterministic way in asynchronous system in the presence of failures. Informally, in the renaming 
problem processors start the computation with a “name” taken from some unbounded ordered name 
space and have to “rename” themselves with names chosen from a new small name space. This result 
revived the research trend of exploring computable and impossible in deterministic asynchronous 
distributed systems subject to failures. Following this direction, Chaudhuri [24] defined the k-set 
consensus (or k-consensus for short) problem, which is a natural generalization of the consensus 
problem obtained by allowing processes to decide on k different values, instead of requiring them to 
agree on a single value. The 1-consensus problem is the classical consensus problem. 

Chaudhuri provided an algorithm to solve the k-consensus problem that tolerates up to a thresh- 
old ¢ of process failures strictly smaller than k. This result proved that the k-consensus problem, 
for k > 2, allows more resilience than the 1-consensus problem. Chaudhuri conjectured that the 
k-consensus problem was impossible to solve while tolerating k or more failures. This conjecture was 
proven true by three independent research teams: Borowsky and Gafni [20], Herlihy and Shavit [52] 
and Saks and Zaharoglou [84]. Attiya [11] provided an alternative proof of the same result. 

The results of [24, 20, 52, 84] completely characterize the k-consensus problem in asynchronous 
systems with stop failures. In such a model the k-consensus problem is solvable if and only if t < k. 

The formal definition of the k-consensus problem requires three conditions to be satisfied: agree- 
ment, termination and validity. The agreement condition requires that each process decide on a 
value in such a way that the set of decided values has cardinality at most k. The termination con- 
dition simply requires that each (correct) process decide. For what concern the validity condition, 
several variants have been considered in the literature. The validity condition used in [24, 20, 52, 84] 
requires that each of the decision be equal to some input value. 

An alternative definition of the validity condition considered for the 1-consensus problem with 
stop failures requires that if all the inputs to the processes of the systems are equal then any decision 
must be equal to the input (see, for example [65, Ch. 6]). This condition is the one considered for 
the k-consensus problem. 


if 


In a Byzantine environment faulty processes can “mask” their inputs. Hence a more suitable 
validity condition considered for the 1-consensus problem with Byzantine failures requires that if all 


the correct processes have the same input then any decision be the input of a correct process [62, 78]. 


In this thesis we explore the solvability of the k-set consensus problem in asynchronous message 
passing models in which processes fail by stopping or fail arbitrarily (Byzantine failures). The main 
theme is that the validity condition has a profound impact on when the problem is solvable. We 
consider six different validity conditions and use these conditions to demarcate when k-set consensus 
is solvable for each system model. In several cases we completely characterize solvability. In some 
we characterize solvability with very little uncertainty (i.e., a small gap between computable and 


impossible) and in a few cases we leave a substantial gap. 


154 


More in details, we start from the validity condition used by Chaudhuri, which we call the 
“regular” validity condition (the decision of any correct process is the input of some process) and 
denote by RV1, and consider a weakened version (if there are no failures then every decision is the 
input of some process), denoted with wv1, and a strengthened version (the decision of any correct 
process is the input of some correct process), denoted by sv1. For each of these three validity 
condition we consider a corresponding weakened version obtained by requiring the condition only 
if all the processes start with the same input value. We denote these validity conditions with sv2 
(if all correct processes start with v the correct processes decide v), RV2 (if all processes start with 
v then correct processes decide v) and Wv2 (if there are no failures and all processes start with v, 
then any decision is equal to v). 

For the crash failures model we completely characterize the line that separates possible from 
impossible for each of the above six validity conditions, with the exception of validity sv2 where a 
tiny gap is left open. 

For the Byzantine failures model we characterize the line line that separates possible from im- 


possible leaving a tine gap for SV2, RV2 and Wv2, and a substantial gap for wv1. 


The rest of this part is structured as follows. In Chapter 10 we give the formal definition of 
the k-consensus problem. We study the k-set consensus problem for crash failure in Chapter 11. 


Chapter 12 presents the results for Byzantine failures. 
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Chapter 10 


The problem 


We consider a distributed system consisting of n processes denoted by py, po, ..-; Pn. A process that 
follows its algorithmic specification throughout an execution is said to be correct, and a process 
that departs from its specification is said to be faulty. In a fail-stop model (also known as a crash 
model), faulty processes are allowed to prematurely halt execution only. In a Byzantine model, a 
faulty process can deviate from its specification arbitrarily. We assume that at most ¢ processes fail, 
where ¢ > 1 is a known, positive integer. 

We assume that the system is asynchronous. Processes communicate by sending messages over 
a network. We are not concerned with the particular topology of the network. Since we consider 
asynchronous systems, messages can take arbitrarily long time to reach their destination. We only 
assume that the network of processes is connected, that is, any process can send a message to any 


other process. Moreover, messages are not altered, lost or duplicated while in transit on the network. 


For any k, 1 < k < n, we denote a k-set consensus problem by KC(k) or simply KC when k is 
not relevant. The KC(k) problem is defined as follows. Each process p; starts the computation with 
an input value v;. Each correct process has to irreversibly “decide” on a value in such a way that 


three conditions, called termination, agreement and validity, hold. These conditions are: 
Termination: Every correct process eventually decides. 
Agreement: The set of values decided by correct processes has size at most k. 
Validity: One of the following conditions. 


sv1 (strong v1): The decision of any correct process is equal to the input of some 
correct process. 

sv2 (strong v2): If all correct processes start with v then correct processes decide v. 

RV1 (regular v1): The decision of any correct process is equal to the input of some 


process. 
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Rv2 (regular v2): If all processes start with v then correct processes decide v. 

wv1 (weak v1): If there are no failures, then the decision of any process is equal to 
the input of some process. 

wv2 (weak v2): If there are no failures and all processes start with v, then the 


decision of any process is equal to v. 


Given a validity condition C’,, we denote by KC(k,C) the KC(k) problem defined with validity C. 
We also use the notation KC(C) if k is not relevant. We use the notation KC(k,t) to denote a KC(k) 
consensus problem with at most ¢ failures allowed. The notation KC(k,t,C) denotes KC(k,t) with 
validity C. 

We define a partial order on the KC problems based on the strength of the validity conditions. 
We say that KC(C) is weaker than KC(D) if any algorithm for solving KC(D) can be used to solve 
KC(C) in a given model. Clearly KC(C) is weaker than KC(D) if any impossibility result that holds 
for KC(C) holds also for KC(D). Conversely, we say that KC(C) is stronger than KC(D) if KC(D) is 


weaker than KC(C). Figure 10-1 shows the “weaker than” relation between the validity conditions. 


jae 


Figure 10-1: Validity conditions. An arrow from a validity condition C to a validity condition D means 
that KC(C) is weaker than KC(D) (and that KC(D) is stronger than KC(C)). 


KC(k,RV1) is the consensus problem as considered by Chaudhuri [24]. KC(1,Rv1) and KC(1,Rv2) 
are classical consensus problems (see, e.g., [65, Ch. 6]). AC(1,sv2) has been considered in the 
Byzantine setting [62, 78]. KC(1,wv2) is weak Byzantine agreement [60]. 

It is well known that the case k = 1 cannot be solved for any nontrivial validity condition and, 
in particular, for any of the validity conditions that we consider here, for any ¢ > 1 [43]. On the 
other hand, if k = n then KC(k) is trivially solvable (each process decides its own value), even in 
the Byzantine setting, for any t and with the strongest validity condition we are considering, that 
is, validity sv1. Thus, we will henceforth be concerned only for the cases 2 << k <n-— 1. Since the 


problem is easily solvable for ¢ = 0 we also assume that ¢ > 1. 
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Chapter 11 


Crash failures 


In this section we consider the crash failures model. In Section 11.1 we recall known results. In 
Sections 11.2 and 11.3, we provide further impossibility results and protocols, respectively. Figure 11- 


1 shows a graphical representation of the results provided in this section. 


11.1 Known results 


As noted in Section 9, for the crash failure models we already know the line between computable 


and impossible for KC(k, t,RV1): 
Lemma 11.1.1 ([24]) In the crash model, there is a protocol for KC (k,t,RV1), fort <k. 
Lemma 11.1.2 ([20, 52, 84]) In the crash model, there is no protocol for KC (k,t,RV1), for t > k. 


By Lemma 11.1.1, we have that KC(k,t,RV2), KC(k,t,wv1) and KC(k,t,wv2) are solvable for 
t < k because RV2, WV1 and wv2 are weaker than RV1. By Lemma 11.1.2, KC(k,t,sv1) cannot be 


solved for t > k because SV1 is stronger than RV1. 


11.2 Impossibilities 


In this section we provide impossibility results for the crash model. An ingredient in most of our 
impossibility results is the fact that in any protocol tolerating ¢ failures, a process must be able 
to decide after communicating with at most n — ¢ processes (including itself). Indeed, if a process 
waited to communicate with more than n—t processes, termination could not be achieved: the runs 


in which there were exactly ¢ faulty processes that do not send any messages, would not terminate. 


Lemma 11.2.1 In the crash model, there is no protocol for KC (k,t,wV2), for t > Ge Unt | 
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Figure 11-1: Crash model. Regions filled in brick pattern indicate impossibility. Regions filled in honeycomb 
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pattern indicate solvability. Unfilled regions indicate open problems. Figures are drawn to scale n = 64. 


Proof: For a contradiction, assume that such a protocol A exists. In the rest of the proof we use 
the notation KC p(k, t, C) to explicitly state the set P of processes among which k-set consensus is 
to be solved. Denoting by P the set of all processes, we have that A solves KC p(k, t,wv2). 

Since t > ((k — 1)n + 1)/k implies n > k(n — t) +1, we can partition the n processes into k 
groups 91, 92, ---, 9~ Of disjoint processes with g,,,...,g,—1 containing exactly n — ¢ processes and g, 
containing at least n — t+ 1 processes. If ¢ = n we let gi, 92,.--,gz%—1 be singleton sets of processes 
and we let g; contain at least two processes (this is possible because we only consider k < 7). 

First we claim that there is a run of A where only processes in g, take steps and such that two 
values are decided. To see why, assume that all the runs involving only processes of g, are such 
that only one value is decided. Then we could use A to solve KC,,(1,1,WVv2): gx contains at least 
n—t+1 processes, so that even if one of them is faulty we still have at least n — ¢ correct processes 
in g, and hence the protocol has to terminate. However, this contradicts [43], since no such protocol 
exists. Hence there is a run a, in which only processes in gz take steps and they decide on at least 
two different values, say vg, Ur4i- Let v1,...,UzR—1 be k — 1 values different from vz, vp41. 

Fix i, 7 € {1,2,...,4 — 1} and consider the following run a;: all processes are correct, all start 
with v; and all messages sent to processes in gj, j = 1,2,....,k by processes not in g; are delayed 
until all processes in g; make a decision. We can use A to solve KC p(k, t,wv2) and by validity wv2 
we have that all processes, in particular those in group g;, decide 2;. 

Now consider the following run a. All processes are correct, for each i, 1 = 1,2,...,4 — 1, each 
process in g; starts with v; and processes in g;, start with the same values they start in a,. Moreover 
for each i, i = 1, 2,...,k, all messages sent to processes in group g; by processes not in g; are delayed 
until all processes in g; have decided. We can use A to solve KC p(k, t,Wv2) in a. However, for each 
i, 1 =1,2,...,k, processes in g; cannot distinguish between run a; and run a. Indeed in both runs 
they only communicate with processes in g; before making a decision and in both runs processes in 
g; start with the same value. Since, for i = 1,2,...,k —1, in run a; processes in g; decide v;, they 
must decide v; also in a. Since in run a, processes in g, decide on vy; and vz41, they must decide v, 
and v,g41 also in a. Hence we have that k+1 values are decided in a. Thus the agreement condition 


is violated and this contradicts the hypothesis that A solves KC p(k, t,wv2). iW 


Lemma 11.2.2 In the crash model, there is no protocol for KC (k,t,wV1), for t > k. 


Proof: For a contradiction assume that there exists such a protocol A. We claim that A can be 
used to solve KC(k,t,RV1) for t > k. To see why, consider any run a in which f < t processes are 
faulty and let g be the set of correct processes and g’ be the set of faulty processes. Now consider 
a run a’ that is identical to a except that all processes are correct and any message sent by any 
pé€g' ina’ after the time that p failed in a is delayed until after all processes in g decide. That 


is, for each p; € g and each p; € g’, p; receives a message from p; at time T in a’ iff p; receives the 
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same message at time JT from p; in a. By the validity condition Wv1, each process decides on some 
process’ input in a’. Clearly, processes in g cannot distinguish between a and a‘. Hence, processes 
in g decide the same value in a as they decide in a’, and so validity RV1 is satisfied in a. In other 


words, protocol A solves KC(k,t,RV1) for t > k, contradicting Lemma 11.1.2. 0 


Lemma 11.2.3 In the crash model, there is no protocol for KC (k,t,SV1). 


Proof: For a contradiction assume that there exists such a protocol A. Let a be an execution of A 
in which all processes are correct and they all start with different values. Let v a decision made by 
at least two processes (there is always such a decision since k < n). Because of validity SV1, v is the 
input of some process p; and since all inputs are different only p; has v as input. Now consider the 
run a’ that is the same as a except that process p; fails right after sending its last message. Clearly 
a and a’ are indistinguishable and thus each process (maybe with the exception of p;) makes the 
same decision in both runs. Hence in a’ value v is decided by at least one process p;, j # i. But 
only p; has v as input and p; is not correct in a’, and so validity sv1 is violated. O 


Lemma 11.2.4 In the crash model, there is no protocol for KC (k,t,Sv2), for t > seri? 


Proof: For a contradiction assume that there exists such a protocol A. Consider first the case 
t > %. Partition the system into two non-intersecting sets of processes, g, g', each containing at 
least n — t processes (e.g., |g| = |g'| = n/2). This is always possible because t > n/2. Let a be a 
run of A in which all processes are correct, all start with different initial values denoted v1, v2, ..., Un, 
and all communication between g and g’ is delayed until after the decisions are made. We claim 
that n values are decided in a. To see this, fix any process p; € g, and consider the following run 
a;. The processes in g start with the same values as in a, and all except p; crash after p; reaches a 
decision. All the processes in g’ start with v; but communication between g and g’' is delayed until 
after p; makes a decision. By sv2, p; must decide v; in a;, and by indistinguishability of a from aj, 
p; must decide v; in a. Similarly, runs aj can be constructed for every process pi € g', and hence 


all processes must decide their own values in a. This contradicts the hypothesis that A solves the 


problem (for k < n). 


pao 


Saar 18 equivalent 


Now consider the case t < 5. In this case, n — 2¢ > 0 and the condition t > n 


tok< ag — 1. Let g be a subset of the system containing n — ¢ processes, and let gu, ey J) mat 


be a partition of g into disjoint sets of size at least n — 2¢ each. Let a be arun of A in which all the 
processes are correct, communication between g and the rest of the system is delayed until after all 
processes have decided and, for each 7, processes in g; start with a distinct value v;. Fix i, and let 
p; € g; be some process. Consider a run a; of A as follows: Processes in g; are correct, all processes 
in g \ g; are faulty, and crash after p; decides. All communication between g and the rest of the 


system is delayed until after p; decides. By sv2, p; must decide v;, but since a is indistinguishable 
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n—t 


“<7 different values are decided 


to p; from a;, p; must decide v; in a. Therefore, in a, at least | 


: : : : —t —t 
on. This contradicts the hypothesis that A solves the problem since k < #3 —1< | aera e | 


11.3 Protocols 


In this section we provide two protocols for the crash model. 


PROTOCOL A: Each process broadcasts its input and waits for n — ¢ messages. If all n—t 
messages contain the same value v, then the process decides v, else it decides a default 


value vo. 
Lemma 11.3.1 PROTOCOL A solves KC (k,t,RV2) in the crash model for t < fn, 


Proof: We start by proving termination. The number of actual failures is less or equal to ¢. Hence 
there are at least n —¢ correct processes. Thus each correct process eventually receives at least n —¢ 
messages and is able to make a decision. 

Now we prove agreement. By the sake of contradiction assume that k + 1 values are decided. 
One of them could be the default value, but at least k values, different from the default value, are 
decided. By the protocol it is necessary that there be k disjoint sets gi, go, ..., gx, each consisting of 
at least n — t processes such that each process in g; sends a value v4; (with v; # v; for i # j). Hence 
there must be at least k(n — t) processes. However since t < “=n we have that n — t > n/k and 
that k(n —t) > n, which implies that there must be more than n processes. This is impossible since 
we have n processes. 

Finally we prove validity. Assume that all processes start with value v. Clearly a process cannot 
receive two different values since v is the only value being sent. Hence by the protocol each process 


that makes a decision, decides v. iW 


PROTOCOL B: Each process broadcasts its input and waits for n — t messages. One of 
these n — ¢ messages is the process’ own message. If n — 2¢ messages contain the same 


value as its own, say v, the process decides v, else it decides a default value ug. 
Lemma 11.3.2 PROTOCOL B solves KC (k,t,Sv2) in the crash model for t < fin. 


Proof: We start by proving termination. The number of actual failures is less or equal to ¢. Hence 
there are at least n —¢ correct processes. Thus each correct process eventually receives at least n —¢ 
messages and is able to make a decision. 

Now we prove agreement. By the sake of contradiction assume that k + 1 values are decided. 
One of them could be the default value, but at least k values, different from the default value, are 


decided. By the protocol it is necessary that there be k disjoint sets 91, go, ..., gz, each consisting of 
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at least n — 2¢ processes such that each process in g; sends a value 1; (with v; 4 v; for i # j). Hence 
there must be at least k(n —2t) processes. However since t < 454n we have that k(n—2t) > n, which 
implies that there must be more than n processes. This is impossible since we have n processes. 
Finally we prove validity. Assume that all correct processes start with value v. We have to prove 
that a correct process decides v. Let p be a correct process. First we observe that since p starts 
with v it decides v or vg. Hence it suffices to prove that p receives at least n — 2¢ messages with 
v. Among the n —7¢ messages p receives at least n — 2¢ are from correct processes. Hence process p 


receives at least n — 2t messages with v. 0 


11.4 Remarks 


For KC(RV2) and KC(wv2), there is a very tiny gap between our possibility and impossibility results 
(Lemmas 11.2.1 and 11.3.1), formed by the cases where n is a multiple of k. These are isolated points 
on the line that separates possible from impossible. Since for all other points on this line the problem 
is not solvable it would be very surprising if for those isolated points the problem is solvable. For 
KC(sv2) there is also small gap between our possibility and impossibility results (Lemmas 11.2.4 
and 11.3.2). 
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Chapter 12 


Byzantine failures 


In this section we consider the Byzantine failures model. In Section 12.1 we are concerned with 
impossibilities and in Section 12.2 we provide protocols. Figure 12-1 shows a graphical representation 


of the results. 


12.1 Impossibilities 


In this section we provide impossibility results for the Byzantine model. Clearly the impossibilities 
proved for the crash model still hold. In particular the impossibilities for KC(Sv1) and KC(wv1) 
are directly derived from the corresponding ones for the crash model. Next we provide additional 


impossibilities. 


Lemma 12.1.1 In the Byzantine model, there is no protocol that solves KC (k,t,wVv2), for t > sea” 


andt > k. 


Proof: For a contradiction assume that such a protocol A exists. We distinguish two cases: (i) 
t > n/2 and (ii) t << n/2. 

Consider case (i). Let v1, v2,..-,Vi41 be #+1 different values. Let a be the following run of A. 
The number of actual failures in a is f = n—t—1. Let F be the set of faulty processes and let 
Pi, ---Pt41 be the correct processes. Process p; has input v;, for i = 1,2,...,¢-+ 1. Messages between 
any two correct processes are delayed until all correct processes decide, that is, correct processes 
communicate only with processes in F’. 

We now show that at least k+1 values are decided in a, which contradicts the hypothesis that A 
solves the problem. For each i = 1, 2, ...,+1 consider the following run a;. All processes are correct, 
all have input v;, messages between processes not belonging to F' are delayed until all processes not 
in F decide. By validity Wv2, we have that in a; all processes must decide v;. Process p;, for 


i=1,2,...,¢+1, cannot distinguish between a and a;, if in a, the members of F' behave as if they 
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Figure 12-1: Byzantine model. 
n= 64. 


were correct and had v; initially. Hence p; has to decide the same value in both runs. We have 
that process p; decides vu; also in a. Since v1, v2,...,Veq1 are different, we have that t+ 1 values are 
decided in a. But ¢ > k, hence at least k +1 values are decided in a. 

Consider case (ii). Since t < n/2 we have that n — 2t > 0 and thus the condition t > sea” 


is equivalent to not > k+1. Then, we can partition the processes into k + 2 groups, the first 


k +1 of which, denoted g1, go,..., 9x41, each consists of at least n — 2¢ processes, and the last of 
which, denoted F’, consists of ¢ processes. Let a be the following run of A. Let v1, v2,..., uR41 be 
k +1 different values. Processes in g; start with v;, for 7 = 1,2,...,k +1, and processes in F are 
faulty. Processes in group g; communicate only within g; and with processes in F’. For each group 
gi processes in F' behave as correct processes with input vj. 

We now show that at least k+1 values are decided in a, which contradicts the hypothesis that A 
solves the problem. For each i = 1, 2,...,&+1 consider the following run a;. All processes are correct, 
all have input v;, processes in group g; communicate only within g; and with processes in F’. By 
validity Wv2, we have that in a; all processes must decide v;. Processes in g;, for i = 1,2,...,k +1, 
cannot distinguish between a and a;. Hence they have to decide the same value in both runs, and 
so processes in g; decide v; also in a. Since v1, v2,...,Ur41 are different, we have that k + 1 values 


are decided in a. | 


Lemma 12.1.2 In the Byzantine model, there is no protocol that solves KC (k,t,RV1). 


Proof: For a contradiction assume that such a protocol A exists. Let a; be a run of A in which all 
processes are correct and each start with a different input value. Let v,...,v, be the set of values 
decided by correct processes. Because A satisfies validity Rv1, each of the vu; is the input of some 
process. Since z < k < n, we have that there exists a value v;, 1 <i < z, decided by at least two 
processes, say p; and po. 

Let process g be the process whose input in a, is v; for some i € {1,...,z}. Use A in the run ag 
in which gq is faulty but behaves as in ay, claiming that vu; is its input, but it has v; as its input, 
with vu; different from vu; and also from any other input. Since correct processes cannot distinguish 
between a; and ag they have to decide on the same value. We now distinguish three possible cases: 
(1) q is different from both p; and po, (2) gis p, and (3) gis po. If q is different from both p, and po 
then both p; and pz are correct and thus they decide on v; in ag. However vu; is not an input value 
in ag. Hence validity is violated. If g is p; (resp. p2) then po (resp. pi) is correct and thus decides 
v; in @g. However v; is not an input value in ag. Hence validity RV1 is violated. This contradicts 


the hypothesis that A solves KC(k,t,RV1). iW 


Lemma 12.1.3 In the Byzantine model, there is no protocol for KC (k,t,RV2), for t > sey" 
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Proof: The proof is similar to that for Lemma 11.2.4. For a contradiction assume that such a 


protocol A exists. We distinguish two cases: (i) t < n/2 and (ii) t > n/2. Consider case (i). Since 


t < n/2 we have that n — 2t > 0 and thus the condition ¢ > ser" is equivalent to —" >k+1. 
Then, we can partition the processes in k + 1 groups each consisting of at least n — 2¢ processes. 
Consider case (ii). In this case we partition the processes in k + 1 groups each consisting of at least 
one process. 

In both cases, let 91, 92,.--;9k,9r+1 be the k +1 groups of processes. Let v1,...un41 be kK +1 
different values and consider the following run a. All processes are correct, processes in group g; 
start with v;. For each group g;, there is a set of ¢ processes not belonging to g;, call it F;, such 
that, for each i, communication is allowed only among processes in g; and F; until all processes have 
decided. Notice that the cardinality of g; U F; is at least n —t in both cases. 

We now show that k+ 1 values are decided in a, which contradicts the hypothesis that A solves 
the problem. Fix i, 1 <i < k+1, and consider run a;. There are exactly t faulty processes and these 
processes are those in F;. Processes in g; are correct. All processes start with v;. Faulty processes 
behave exactly as they do in run a. Processes in g; communicate only with other processes in g; and 
F,. We can use A to solve KC(k,t,RV2), and by the validity Rv2 we have that all correct processes, 
and in particular those in g; decide v;. Processes in g; cannot distinguish run a and run a;. Hence, 
since they decide v; in a; they have to decide v; also in a. It follows that k +1 values are decided 


in a. Oo 


12.2 Protocols 


In this section we provide protocols for the Byzantine model. We start by observing that PROTO- 
COL A, used for the crash model, solves KC(WvV2) also in the Byzantine model, though only for a 


restricted range of values of k and t. 


Lemma 12.2.1 PROTOCOL A solves KC (k,t,wv2) in the Byzantine model for t < n/2 and k > 


+t 
nat tI. 


Proof: We start by proving termination. Since there are at most ¢ failures, correct processes are 
guaranteed to receive at least n — ¢ messages and thus they decide. 

Next we prove agreement. To have a bound on the number of possible decisions we look at 
how many values different from the default value can be decided. Let f be the number of actual 
failures. We have that any group of n —t — f correct processes that start with the same value can 
be forced by the f faulty processes to decide that value. Notice that since f < t < n/2 we have 
that n —t— f > 1. Hence the number of decisions can be as big as the number of possible disjoint 


groups of n — t — f correct processes, plus one to take into account the default value. There can 
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be at most (n — f)/(n —t — f) such groups. This function is an increasing function of f and thus 
it achieves its maximum value for f = t. Hence the number of different decisions we can have is at 
most (n — t)/(n — 2t) + 1. Since k > (n — t)/(n — 2t) +1 agreement is satisfied. 

Finally we prove validity. Assume that all processes are correct and start with v. Then clearly 


v is the only decision. 0 


Lemma 12.2.2 PROTOCOL A solves KC (k,t,wV2) in the Byzantine model for t > n/2 andk > t+1. 


Proof: Termination and validity are as in the previous lemma. Next we prove agreement. Let f be 
the number of actual failures. We distinguish two cases: (i) f <n—t-—2 and (di) f >n-—t—-2. 
In case (i) we have that for any n — ¢ messages received by a process, at least two of them are sent 
by correct processes. Hence for each different value v 4 vg decided by some process at least two 
correct processes have sent that value. Hence no more than n/2 values different from the default 
value vg can be decided. Hence at most n/2+ 1 different values can be decided in case (i). In case 
(it) the number of correct processes is strictly less than t+ 2. Hence we cannot have more than t+1 
different decisions. Putting together the two cases, we have that the number of different decisions 


is at most max{n/2+1,t+1}=t+1<k. O 


Next we provide a generalized version of the “echo” protocol of Bracha and Toueg [22], which we 
call é-echo, where £ > 2. (The 1-echo protocol is Bracha and Toueg’s echo protocol.) The ¢-echo 


protocols will be used to provide a family of protocols for KC(sv2). 


f-echo protocol: To f-echo broadcast a message m, the sender s sends the message 
{init,s,m) to all other processes. When a process p receives the first (init,s,m) from s, 
it sends the message (echo,s,m) to all other processes. Subsequent init messages from 
s are ignored. If process p receives message (echo ,s,m) from more than (n+ £t)/(€+1) 


processes, then process p accepts message m as sent by the sender process s. 


Lemma 12.2.3 In a system with t < n/(2£4+1), if a sender s uses the £-echo protocol to send a 


message m then: 
(i) Correct processes accept at most € different messages. 


(ii) If s is correct, every correct process accepts m. 


Proof: First we prove (i). By sake of contradiction assume that correct processes accept ¢+ 1 
different messages m1, Mz, ..., M41. Then there must be &+ 1 correct processes, say 1, po, -.-, P41, 
such that process p; receives more than (n+ t)/(€+1) echos with m,, for each i = 1, 2,...,+1. Thus 
there must be a total of more than n+ ft echos sent for the messages m1, mz, ...,Me41-. Let f be the 


actual number of faulty processes. Since a faulty process can send £+1 different echos (it can echo m1 
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to pi, Mz to pz and so on) we have that strictly more than n+ ft—(€+1)f > n+éf—(€+1)f=n-f 
echos are sent by correct processes. This implies that at least one correct process sent two different 
echos, which is not possible. 

Now we prove (ii). If the sender is correct, then it sends an init message for m to all other 
processes. Any correct process will receive this and broadcast an echo message for m. Because there 
are at most t < (n+ £t)/(€+1) faulty processes, no correct process accepts any message other than 
m,. Since there are at least n —t¢ correct processes, it is sufficient that n —¢ be strictly greater than 
(n + £t)/(€+ 1) in order to guarantee that any correct process receives enough echo messages to be 


able to accept m. Since t < €n/(2€+1) we have that n—t > (n+ £t)/(€+1). O 


The é-echo protocol is used to define a family of protocols for KC(k,t,SV2) as follows. 


PROTOCOL C(£): Each process broadcasts its input using the ¢-echo protocol and waits 
for n — t messages to be accepted, where one of these n — ¢ messages is the process’ own 
message. If n — 2¢ messages contain the same value v, then the process decides v, else it 


decides a default value vo. 


Lemma 12.2.4 PROTOCOL C(é) solves KC (k,t,SV2) in the Byzantine model for t < see” and 


£ 


Proof: We start by proving termination. Since there are at least n—¢ correct processes, each correct 
process eventually accepts at least n —t messages broadcast by é-echo and is able to make a decision. 

Now we prove agreement. For a contradiction assume that k+1 values are decided. One of them 
could be the default value, but at least k values, different from the default value, are decided. By the 
protocol it is necessary that there be k sets gj, go, ...-, gx, each consisting of at least n — 2¢ processes, 
such that some correct process accepts a value v; from each process in g; (with vu; 4 v; for i F J). 
Hence correct processes accept at least k(n—2t) values broadcast by ¢-echo. Each faulty process can 
contribute £ different values, and so the number of different senders is at least k(n — 2¢) — (€ — 1)t. 
However since t < sea we have that k > atte and thus k(n — 2t) — (€—1)t > n, which 
implies that there must be more than n processes, a contradiction. 

Finally we prove validity. Assume that all correct processes start with value v. We have to prove 
that a correct process decides v. Let p be a correct process. First we observe that since p starts with 
v it either decides v or vo. Hence it suffices to prove that p receives at least n — 2t messages with 
v. Among the n —t messages p receives at least n — 2¢ are from correct processes. Hence process p 


receives at least n — 2¢ messages with v. HT 


Finally we provide a protocol for KC(wv1). 


PROTOCOL D: Processes pj, po, ..., P41 each broadcasts its input value. A process that 


receives a value v; from p;, i € {1,2,...,¢+1}, broadcasts an (echo,v;,p;) message and 
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never echos a value for p; again. Processes pj, p2,..., px each decides on its own value. 
Every other process decides the first value vj, i € {1,...,¢+ 1}, for which it receives 


identical echos (echo,v;,p;) from n — t processes. 


In PROTOCOL D, we say that a process accepts a value v; from p; if it receives identical echos for v; 


from at least n — t processes. We define the following functions 


n—f ifn—t—f<0 


Vint =| et pa pect ifn—t—f>0 


and 


Z(n,t) = cna train V (n, t, f),n— fh}. 


Lemma 12.2.5 PROTOCOL D solves KC (k,t,wv1) in the Byzantine model for k > Z(n,t). 


Proof: We start by proving termination. At least one process among p1,..., p41 is correct, and at 
least n — t receive its value and echo it. Hence it is guaranteed that each correct process receives at 
least one set of identical n — ¢ echo messages and thus is able to decide. 

Next we prove validity. Assume that there are no failures. Then all processes are correct and 
thus the values accepted by any process are input values. All decisions are one of the accepted 
values. Hence validity WV1 is satisfied. 

Finally we prove agreement. We compute an upper bound on the number of different decisions 
for each possible value of f, that is the number of actual failures. By definition, 0 < f < t. We 
distinguish two cases: (¢) n-—t— f < 0 and (ii) n—t— f > 0. In case (4) a correct process may 
be forced to communicate only with faulty processes. In this case we simply bound the number of 
decisions with the number of correct processes, that is n — f. In case (id) the total number of values 


that correct processes accept from one faulty process is bounded by [4 FI: Indeed, a correct 


process accepts a value when receiving at least n — ¢ echos, at least n — t — f of which are from 


correct processes. Thus the total number of values from py, ...,p:41 accepted by correct processes 


is at most ((+1-—f)+f | Fl; that is the number of values sent by correct processes plus the 


number of values that correct processes may be forced to accept because of the Byzantine behavior 


of faulty processes. Hence the number of different decisions that we can have ist+1—f+f | ac 
It is possible that this bound is bigger than n — f. In such a case, we can bound the number of 
different decisions by n — f. Summarizing the two cases we have that for any f, we bound the 
number of decisions by n— f ifn—t— f < 0 and by min{t+1—f+f| 2 -5].n-f} ifn—t—f>0. 
The maximum over all possible values of f is given by Z(n,t). Hence we have that the number of 


decisions is always at most Z(n,t), as required. O 
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We note that when t < 4, Los! = 1 for all 0 < f < t, and therefore, the protocol above 


guarantees agreement for any k > t (see Figure 12-1). 


12.3. Remarks 


For the Byzantine model, the impossibility results and protocols we have provided in this section 
leave a small gap for the KC problem defined with validities WV2, RV2 and Sv2 and a substantial 


gap for KC(wv1). An interesting open problem is to fill in this gap. 
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Chapter 13 


Conclusions 


The k-set consensus problem is an abstraction of many coordination problems in a distributed system 
that can suffer process failures. In this thesis we have investigated the k-set consensus problem in 
asynchronous message passing distributed systems. We have extended previous work by exploring 
several variations of the problem definition and model, including for the first time investigation of 
Byzantine failures. We have shown that the precise definition of the validity requirement, which 
characterizes what decision values are allowed as a function of the input values and whether failures 
occur, is crucial to the solvability of the problem. For example, we show that allowing default 
decisions in case of failures makes the problem solvable for most values of k in face of minority- 
failure, even in face of the most severe type of failures (Byzantine). We have introduced six validity 
conditions for this problem (all considered in various contexts in the literature), and demarcate the 
line between possible and impossible for each case. In many cases this line is different from the one 


of the originally studied k-set consensus problem. 


In this thesis we have considered asynchronous systems. A natural question to ask is: what 
happens in synchronous systems? Clearly any algorithm that works in asynchronous systems works 
also in synchronous systems. 

Let us first consider the case of stop failures. The FloodSet algorithm (see for example [65, Ch. 
7|) solves the KC problem in synchronous systems with stop failures. It tolerates any number of 
failures, that is, it works for any t < n. The validity condition considered is validity RV1. Hence this 
algorithm works also for validities RV2, WV1 and wWv2. The impossibility proof for KC validity sv1 
that we have provided for asynchronous systems works also for synchronous systems. Hence there is 
no KC protocol for validity sv1, synchronous systems, stop failures. The above cover pretty much 
all the cases we have considered. The only open case is validity sv2: we can use the algorithm for 
asynchronous system that solves the problem for t < n/4 when k = and for ¢t < n/3 for k > 3. For 


other cases we don’t know. 
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For the Byzantine case there is no work on KC for synchronous system. It is known that 
KC(1,8v2) can be solved if and only if t < n/3 [78, 62]. The ElGbyz algorithm (see [65]) pro- 
vides a solution KC(1,SV2) when t < n/3. Lamport [60] proved that also KC(1,t,wv2) can be solved 
if and only if t < n/3. Clearly EIGbyz solves also KC(1,wv2). No results for KC(k), k > 2, are 
known for synchronous systems with Byzantine failures. Obviously one can use the algorithm pro- 
vided in this thesis for asynchronous system. However this still leaves large gaps for the values of k 


and ¢ for which we don’t know if the problem is solvable or not. 


Another natural question to ask is what happens in shared memory systems. Algorithms that 
work for message passing systems work also for shared memory system because a channel can be 
simulated with shared memory. The FLP impossibility proof can be generalized to shared memory 
((64]). The impossibility result of [20, 52, 84] works also for shared memory. In some of the 
impossibility proofs we provided in this thesis we used the fact that the system is message-passing; 
hence we do not know whether the impossibility results still hold in the shared memory settings. 
We conjecture that the techniques used in this thesis can be used to provide a similar analysis for 


the shared memory models. 
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